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Preface 


Features of the book The writing of this book was prompted by the surge of 
research activities in data science, and the role of information theory in the field. 
This forms the motivation for this book, enabling three key features. 

The first feature is the demonstration of principles and tools of information 
theory in the context of data science applications, such as social networks, DNA 
sequencing, search engine, and artificial intelligence (AI). Information theory is a 
fundamental field that have made foundational impacts upon a wide spectrum of 
domains in science and engineering. It was established by Claude Shannon in 1948 
and deals with mathematical laws that govern the flow, representation and transmis- 
sion of information. The most significant achievement of the field is the invention 
of digital communication which forms the basis of our daily-life digital products 
such as smart phones, laptops and Internet of Things (IoT) devices. While the field 
was founded in communication, it has since expanded beyond its original domain, 
contributing to a widening array of contexts, including networks, computational 
biology, quantum science, economics, finance, and even gambling. Therefore, sev- 
eral books on information theory have been published over the past few decades, 
covering a broad range of subjects (Gallager, 1968; Cover, 1999; MacKay, 2003; 
Yeung, 2008; Csiszár and Körner, 2011; El Gamal and Kim, 2011; Gray, 2011; 
Gleick, 2011; Pierce, 2012; Wilde, 2013). However, this book focuses on a single 
field: data science. Out of the vast content, we emphasize the information-theoretic 
concepts and tools related to data science applications. These applications include: 
community detection in social networks, DNA sequencing in biological networks, 
ranking in search engine, supervised learning, unsupervised learning and social AI. 

Secondly, this book is written in a lecture-style format. Most books on this sub- 
ject cover numerous mathematical concepts and theories, as well as various appli- 
cations in diverse domains. The concepts and relevant theories are presented in a 
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dictionary-style organization, with topics listed in a sequential order. Although this 
dictionary-style organization makes it easy to find specific material, it often lacks a 
cohesive narrative that can engage and motivate readers. This book aims to engage 
and motivate those who are interested in data science and its interconnections with 
other disciplines. Our aim is to create a compelling narrative that emphasizes the 
significance of fundamentals in the field. To achieve this, we have adopted a lecture- 
style format, with each section serving as notes for a lecture lasting approximately 
80 minutes. A consistent connection is established across sections through themes 
and concepts. To ensure a smooth transition from one section to the next, we have 
included two paragraphs: (i) the “recap” paragraph that summarizes what has been 
covered and motivates the contents of the current section; and (ii) the “look ahead” 
paragraph that introduces the upcoming contents by linking it to previous material. 

The final feature of this book is the inclusion of many programming exercises via 
two software languages: (i) Python; and (ii) TensorFlow. While C++ and MATLAB 
are widely used in traditional fields, Python has become a key software in data 
science. Given the breadth of data science applications covered in the book, we 
have selected Python as our primary platform. To implement machine learning 
and deep learning algorithms, we utilize TensorFlow, one of the most popular deep 
learning frameworks. TensorFlow provides many built-in functions for performing 
many important procedures in deep learning, and its integration with Keras, a high- 
level library that emphasizes fast user experimentation. With Keras, we can easily 
transition from an idea to implementation with minimal steps. 


Structure of the book This book consists of course materials developed at 
KAIST over the past decade: (i) EE623 Information Theory (offered from Fall 2012 
to 2016 and in 2018 and in 2019); (ii) EE326 Introduction to Information The- 
ory and Coding (offered in Spring 2016 and 2017); (iii) EE321 Communication 
Engineering (offered from Spring 2013 to 2015 and in 2022); (iv) EE523 Convex 
Optimization (Spring 2019); and (v) EE424 Introduction to Optimization (Fall 
2020 and 2021). It is structured into three parts, each consisting of many sections. 
Each section covers the material from a single lecture, which lasted approximately 
80 minutes. Problem sets, which served as homework in the courses, are included 
every three or four sections. The detailed contents are summarized as below. 


I. Source coding (9 sections and 3 problem sets): A brief history of informa- 
tion theory; in-depth study of key notions and Python exercise (entropy, 
joint entropy, mutual information, Kullback-Leibler (KL) divergence); role 
of entropy in source coding theorem; prefix-free codes; Kraft’s inequality; 
typical sequences and the asymptotic equipartition property; entropy rate; 
Huffman code and Python implementation. 
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Il. Channel coding (9 sections and 3 problem sets): Role of mutual information in 
channel coding theorem; capacity of binary erasure channels (BECs), binary 
symmetric channels (BSCs), and discrete memoryless channels (DMCs); 
random coding; maximum a posteriori probability (MAP) decoding, max- 
imum likelihood (ML) decoding; jointly typical sequences; union bound; 
Fano’s inequality; data processing inequality; polar code and Python imple- 
mentation. 

Ul. Data science applications (20 sections and 6 problem sets): Community 
detection in social networks; the achievability and converse proofs of the 
fundamental limits; the spectral algorithm and Python implementation; 
Haplotype phasing in computational biology; the achievability and con- 
verse proofs of the fundamental limits; an advanced two-staged algorithm 
and Python implementation; top-K ranking in search engine; a variant of 
PageRank, an advanced two-staged algorithm and their Python implemen- 
tation; supervised learning, and the role of cross entropy in logistic regres- 
sion and deep learning; gradient descent; TensorFlow implementation of 
a digit classifier; unsupervised learning, and the role of the KL divergence 
in Generative Adversarial Networks (GANs); alternating gradient descent; 
TensorFlow implementation of a digit image GAN; fair machine learn- 
ing, and the role of mutual information in the design of a fair classifier; 
TensorFlow implementation of a recidivism predictor. 


In terms of data science applications, several sections are adapted from the 
author’s previous books: (i) “Convex Optimization for Machine Learning” (Suh, 
2022); and (ii) “Communication Principles for Data Science” (Suh, 2023). The 
contents has been tailored to fit the theme of this book, which focuses on the role 
of information-theoretic concepts and tools. The book also includes three appen- 
dices: two providing brief tutorials on the programming languages used (Python 
and TensorFlow); and one offering guidance on how to conduct research (primar- 
ily aimed at student readers). These tutorials have been adapted from (Suh, 2022, 
2023), with appropriate modifications to suit the focused topics. At the end of the 
book, a list of references relevant to the discussed content is provided, but these are 
not explained in detail, as we do not aim to exhaust the extensive research literature. 


How to use this book This book is written as a textbook for a senior-level under- 
graduate course and is also suitable for a first-year graduate course. The expected 
background includes solid undergraduate courses in probability and random pro- 
cesses, as well as basic familiarity with Python. 
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For students and interested readers, we provide the following guidelines: 


1. Study one section per day and two sections per week: This is recommended, as 
each section is designed for a single lecture and two lectures are typical per 
week in a course offering. 

2. Complete the content in Parts I and II: One of the most important concepts 
in information theory is phase transition, together with the achievabilty and 
converse proofs. Also the following three notions are crucial: entropy, mutual 
information and the KL divergence. If you are already familiar with these, 
you can quickly review Parts I and II before proceeding to Part III. However, 
if you are not familiar with these concepts, it is recommended to read Parts I 
and II in a sequential manner. The sections are arranged in a way that builds 
a motivating storyline, and appropriate exercise problems are interspersed 
throughout to enhance your understanding and motivation. 

3. Explore Part III as per your interest: Part III focuses on applications and can be 
partially read, if desired. However, it is structured so that each section builds 
upon the previous one, assuming a sequential reading. One of the key aspects 
of Part III is the implementation of algorithms in Python and TensorFlow. 
With the guidance provided in the main text, problem sets, and appendices, 
you should be able to implement the algorithms covered. 

4. Solve four to five basic problems in each problem set: Over 130 problems 
(including more than 280 subproblems) are provided. Most of them elab- 
orate on the concepts discussed in the main text. The exercises cover 
basics in probability and random processes, relatively simple derivations of 
results from the main text, in-depth exploration of non-trivial concepts not 
fully explained in the main text, and implementation through Python or 
TensorFlow. The problems are closely tied to the established storyline, so it 
is essential to work on at least some of them to fully understand the material. 


In the course offerings at KAIST, we have covered most of the materials in Parts 
I and II, but only a limited number of applications in Part III. Based on the stu- 
dents’ backgrounds, interests, and available time, there are several ways to structure 
a course utilizing this book. For example: 


1. Semester-based course (24—26 lectures): This option would entail covering all 
the sections in Parts I and II, as well as two to three selected applications from 
Part III, e.g., (i) community detection and supervised learning, (ii) Haplo- 
type phasing and GANs, or (iii) top-K ranking and fair machine learning. 

2. Quarter-long course (18—20 lectures): This option would encompass almost all 
the materials in Parts I and II, excluding certain topics like Huffman coding, 
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source-channel separation theorem, the role of feedback, and polar codes. 
Investigate two applications picked up from Part III. 

3. A graduate-level course for students with prior knowledge of information theory: 
This option would provide a brief review of the contents in Parts I and II, 
taking approximately 6-8 lectures. The focus would be on covering as many 
of the materials in Part III as possible. 


Programming exercises can be included as homework assignments to enhance the 
in-class learning experience. 


DOI: 10.1561/9781638281153.ch1 


Chapter 1 


Source Coding 


1.1 Overview of the Book 


Outline In this section, we will cover two basic stuffs. Firstly, we will discuss the 
logistics of the book, providing details on its organization. Secondly, we will provide 
a brief overview of the book, including the history of how information theory was 
developed and what will be covered throughout the book. 


Prerequisite A basic understanding of probability and random processes is 
required before proceeding with this book. This can be achieved by taking 
introductory-level courses on the topics, typically offered in the Department of 
Electrical Engineering. If you have taken equivalent courses, this is also acceptable. 
The importance of probability in this book is rooted in the fact that information 
theory was developed in the context of communication, where the relationship 
between information theory and probability is evident. 

Communication is the transfer of information from one end (called the 
transmitter) to the other (called the receiver), over a physical medium (like an air) 
between the two ends. The physical medium is called the channel. The channel 
links the concept of probability to communication. If you think about how the 
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channel behaves, you can easily see why. The channel can be interpreted as a system 
(in other words, a function) that inputs a transmitted signal and outputs a received 
signal. However, the channel is not a deterministic function, as it is subject to ran- 
dom elements, also known as noise, that are added to the system. Typically, the 
noise is additive, meaning that the received signal is the sum of the transmitted sig- 
nal and the noise. In mathematics or statistics, such a random quantity is referred 
to as a random variable or random process, which is based on probability. This 
is why a comprehensive understanding of probability is crucial for understanding 
this book. If you have taken a basic course on probability but are not familiar with 
random processes, dont be concerned. Whenever the topic of random processes 
arises, we will provide detailed explanations and exercises to help you comprehend 
the material. 

There is another important course that can aid in understanding the material in 
this book, such as a course on random processes, e.g., EE528 at KAIST or EE226 
at UC Berkeley. This is a graduate-level course that delves deeper into probability, 
encompassing many crucial concepts related to random processes. If you have the 
passion and time, we strongly recommend taking this course while reading this 
book, though it is not a prerequisite. 


Problem sets Problem sets are provided every three to four sections, with a 
total of 12 problem sets. We encourage working together with other peers if 
available, as problem sets serve as opportunities for learning, and any method 
that enhances your learning is encouraged, including discussion, teaching oth- 
ers, and learning from others. Solutions will be made available only to instructors 
upon request. Some problems may require the use of programming tools such as 
Python and TensorFlow. We will be using Jupyter notebook. For further infor- 
mation, please refer to the installation guide in Appendix A.1 or seek assistance 
from: 


https : //jupyter.readthedocs.io/en/latest/install.ptml 


We provide tutorials for the programming tools in appendices: (i) Appendix A for 
Python; and (ii) Appendix B for TensorFlow. 


History of communication We will explore the establishment and evolution 
of information theory within the field of communication. We will begin with a 
recount of the communication industry's role in the establishment of information 
theory. Afterwards, we will delve into the in-depth topics that we will cover through- 
out the book. 

Communication is the transfer of information from one end (the transmitter) to 
the other (the receiver). In between lies a physical medium, known as the channel. 
The history of communication dates back to the beginning of civilization, where 
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people communicated through dialogue. However, this form of communication has 
no relation to the electronic communication systems prevalent today. There was a 
major breakthrough in the history of communication with the invention of the 
telegraph by Samuel Morse (Beauchamp, 2001). Morse code! was the first instance 
of a simple transmission system used in the telegraph. The invention was based on 
the discovery in physics that electrical signals, such as voltage or current signals, 
could be transmitted over wires, such as copper lines. This was the first communi- 
cation system to use electrical signals and is the reason that communication systems 
are studied in electrical engineering. Over time, this technology was improved and 
Alexander Graham Bell invented the telephone (Coe, 1995). 

Later advancements in communication systems were made based on another 
discovery in physics, that electrical signals could be transmitted wirelessly through 
electromagnetic waves, known as radio waves. This discovery inspired Guglielmo 
Marconi to develop a wireless version of the telegraph, known as wireless telegra- 
phy (Bondyopadhyay, 1995). Over time, this technology was further developed, 
leading to the invention of radio and television. 


The state of affairs in the early 20th century and Claude E. Shannon In 
the early 20th century, several communication systems emerged, such as telegraphs, 
telephones, wireless telegraphs, radios, and televisions. Claude E. Shannon, known 
as the father of information theory, made a noteworthy observation about these 
systems during this time. He pointed out that the engineering designs of these sys- 
tems were customized and specific to each application, resulting in varying design 
principles for different signals. 

Shannon was discontent with this ad-hoc approach and felt that a general frame- 
work was necessary to unify these different communication systems. With this in 
mind, he formulated three questions aimed at integrating the fragmented approach. 


Shannon’s questions The first question is the most fundamental in terms of 
the possibility of unification in communication systems. 


Question 1: Is there a general unified methodology for 


designing communication systems? 


The second question is a logical follow-up to the first and is aimed at addressing 
it. Shannon believed that if unification was possible, there could be a common 
currency (such as the dollar in economics) with respect to information. In com- 
munication systems, there are a variety of information sources, such as text, voice, 


1. The term “code” should not be mistaken for computer programming languages, such as C++ and Python, 
as they have no relationship to it. In the communication literature, “code” refers to a transmission scheme. 
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video, and images. The second question addresses the existence of such a common 
currency. 


Question 2: Is there a common currency of information that can 


represent different information sources? 
The last question pertains to the communication process itself: 
Question 3: Is there a limit on the speed of communication? 


By addressing these questions, Shannon was able to develop a single theory, which 
would later be known as information theory. 


What Shannon did What Shannon did can be divided into three parts. Firstly, 
he demonstrated that the answer to the second question was affirmative, and he 
devised a common currency of information that could represent different types of 
information sources. With this common currency, he then addressed the first ques- 
tion and developed a single comprehensive framework that could unify all the vari- 
ous communication systems. Under this unified framework, he answered the third 
question by showing that there is a limit to the amount of information that can be 
communicated, expressed in terms of the common currency. He also characterized 
this limit. 

Interestingly, during the process of characterizing the limit, Shannon made an 
important observation. The limit is solely dependent on the channel, regardless 
of any transmission and reception strategy. This means that for a given channel, 
there is a fundamental limit to the amount of information that can be transmitted, 
and beyond this limit, communication becomes impossible, no matter what. This 
quantity does not change, regardless of the actions of the transmitter and receiver. 
It is like a fundamental law dictated by nature. Shannon theorized this law in a 
mathematical framework and referred to it as “a mathematical theory of communi- 
cation” in his landmark paper (Shannon, 2001). Later, this theory became known 
as information theory or the Shannon theory. 


A communication architecture Next, we will explain how Shannon accom- 
plished these tasks, and then we will outline the specific topics that will be covered 
in this book. 

To begin, we will introduce an additional term, in addition to the three terms 
of transmitter, receiver, and channel. This new term is called “information source,” 
and it refers to the information that one wishes to transmit, such as text, voice, or 
image pixels. According to Shannon, there must be a process that transforms the 
information source before it is transmitted. He envisioned this process as a black 
box, which he called an encoder. At the receiver, there must also be a process that 


Overview of the Book 5 


transmitter receiver 


information’ 
source 


encoder decoder 


Figure 1.1. A basic communication architecture. 


tries to recover the information source from the received signals, which Shannon 
referred to as a decoder. This was the first block diagram that Shannon imagined for 
a communication architecture, as shown in Fig. 1.1. From Shannon’s perspective, 
a communication system is simply a collection of an encoder and a decoder, and 
designing a communication system involves creating an appropriate pair of encoder 
and decoder. 


Representation of an information source With the basic architecture 
(Fig. 1.1) in mind, Shannon sought to unify the various communication systems 
that existed at the time. Many engineers at the time transmitted information sources 
without significant modification, despite the variations in the information sources 
based on the application. Shannon believed that this was the reason behind the 
existence of multiple communication systems. 

To achieve unification, Shannon believed that there had to be a common rep- 
resentation that could be used to describe different information sources. His work 
on his Master’s thesis at MIT (Shannon, 1938) was pivotal in finding a universal 
way of representing information sources. He used Boolean algebra to demonstrate 
in his thesis that logical relationships in circuit systems could be represented using 
binary strings, represented by the 0/1 logic. 

Encouraged by this, Shannon theorized that the same approach could be applied 
to communication systems, meaning that any type of information source could 
be represented using a binary string. He proved that this was indeed possible by 
demonstrating that binary strings, known as “bits,” could represent the meaning 
of information. For instance, in the case of an English text consisting of multiple 
letters, how could each letter be represented using a binary string? One key realiza- 
tion was that there is a finite number of possibilities for each letter. This number 
refers to the total number of letters in the English alphabet, which is 26, excluding 
any special characters like spaces. From this observation, it can be deduced that 
[log, 26] = 5 number of bits suffices to represent each letter. 


A two-stage architecture This realization led Shannon to propose bits as a 
standard unit of information. He proposed a two-stage architecture where the 
encoder was divided into two parts. The first part, known as the “source encoder,” 
was responsible for converting the information source into bits. The second part, 
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known as the “channel encoder,” was responsible for converting the bits into a signal 
that could be transmitted over a channel. 

Similarly, the receiver operates in two stages, but in reverse order. The received 
signals are first converted into bits through the channel decoder, and then the infor- 
mation source is reconstructed from the bits through the source decoder. The source 
decoder should have a one-to-one mapping, or there will be no way to recreate the 
original information source. 

The portion of the system that extends from the channel encoder, through the 
channel, to the channel decoder is referred to as the “digital interface.” This digital 
interface is universal and agnostic to the type of information source, as the input to 
the digital interface is always bits, regardless of the source. In this sense, it provides 
a unified communication architecture. 


Two questions on the fundamental limits Keeping the two-stage architec- 
ture (Fig. 1.2) in his mind, Shannon tried to address the third question: Is there 
a limit on how fast one can communicate? Shannon discovered the importance of 
having bits as a standard unit of information. In his proposed two-stage architec- 
ture, the source encoder is responsible for converting the information source into 
bits, before the channel encoder converts the bits into a signal that can be trans- 
mitted over a channel. 

In order to maximize the amount of information transmitted, Shannon consid- 
ered the efficiency of the source encoder. To do this, he split his third question 
into two sub-questions: the first focused on finding the minimum number of bits 
needed to represent the information source, and the second focused on determining 
the maximum number of bits that can be transmitted over a channel successfully. 


transmitter 
l 


channel 
encoder 


information 
source 


source 
encoder 


source i channel 
decoder decoder 


receiver 


Figure 1.2. A two-stage communication architecture. 


Overview of the Book 7 


Shannon developed two theorems to answer these sub-questions. The first, called 
the “source coding theorem,” determined the minimum number of bits needed to 
represent the information source. The second, called the “channel coding theorem,” 
characterized the maximum transmission capability. 


Source coding theorem Let’s first discuss the source coding theorem. An infor- 
mation source can be represented as a sequence of elementary components, such 
as a sequence of English alphabets, audio signals, or image pixels, represented as 
S1, S2, $3,.... Shannon viewed this sequence as a random process, as the informa- 
tion source is unknown to the receiver. 

The minimum number of bits needed to represent the information source 
depends on the probabilistic properties of the random process, specifically its joint 
distribution. For example, when the random process has a simple joint distribution, 
such as when the individual variables are independent and identically distributed 
(i.i.d.), it is straightforward to state the source coding theorem. 

In this context, each individual random variable is referred to as a “symbol.” The 
minimum number of bits needed to represent the information source is related 
to the concept of entropy, which is a measure of disorder or randomness. This 
measure plays a crucial role in formulating the source coding theorem, which states 
that the minimum number of bits needed to represent the information source is 
proportional to the entropy of the random process. 


Theorem 1.1 (Source coding theorem in the i.i.d. case). The minimum number 
of bits that can represent the source per symbol is the entropy of the random variable S, 
denoted by 


H(S) := > Ps(s) log, (1.1) 


1 
E Ps(s) 
where Ps(s) denotes the probability distribution” ofS, and S (that we call “caligraphy 
S”) indicates the range (the set of all possible values that S can take on). 


Source code example To gain a clearer understanding of the source coding 
theorem, lets examine a practical example. The objective in source coding is to 
establish a functional relationship, denoted by f, between the input sequence S and 
the output from the source encoder, referred to as the “codeword”. For instance, 
consider a DNA sequence where each symbol S can take on one of four values: 
A, C, T, G. To make things simple, let’s consider an unrealistic yet straightforward 


2. It is a probability mass function, simply called pmf, for the case where S is a discrete random variable. It is 
often denoted by p(s) for brevity. 
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scenario where the random process, represented by {5;}, is independent and iden- 
tically distributed (i.i.d.). In this case, each symbol in the sequence is distributed as 
follows: 


A, with probability (w.p.) $; 


C, w.p. i 
S= 

T, w.p. $3 

G, w.p. 7 


How can we design the functional relationship f (S) in order to minimize the aver- 


age length of the codeword, represented by E[f(S)], and reduce the number of bits 


needed to represent S? With a total of four letters, it is sufficient to use two bits per 
symbol. A simple approach might be to assign A to 00, C to 01, T to 10, and G to 
11. This would result in an average of 2 bits per symbol. However, according to the 
source coding theorem, it is possible to attain better results. The limit promised is: 


1 1 1 1 
H(S)=-.1 .2 i -3 = 1.75. 1.2 
(S) 5 ao er oa te 75 (1.2) 


The existence of a code that achieves the desired limit has been confirmed. The 
code is based on the following observations: (i) the letter A occurs more frequently 
than the other letters, and (ii) the length of the codeword does not have to be fixed. 
These observations lead to a natural idea. Assigning short codewords to frequent 
letters and long codewords to less frequent letters. To implement this idea, a “binary 
code tree” is introduced to facilitate the mapping from S to f (S). 

A binary code tree: A binary code tree is a tree structure where every internal node 
has only two branches, and a node without any branches is referred to as a leaf. An 
example of this can be seen in Fig. 1.3. The binary tree is related to a code by 
assigning a symbol to a leaf and defining the functional relationship by specifying 
the pattern of the corresponding codeword. This is done by following the sequence 
of binary labels (associated with branches) from the root to the leaf. For instance, 
if an upper branch is labeled 0 and a lower branch is labeled 1, and the symbol A 
is assigned to the top leaf, then f(A) = 0, as there is only one branch (labeled 0) 
linking the root to the leaf. Similarly, f(C) = 10, as there are two branches (labeled 
1 and 0, respectively) connecting the root to the leaf assigned to C. 

How to implement an optimal mapping rule that achieves 1.75 bits per symbol 
using a binary code tree? As previously noted, the goal is to assign short codewords 
to frequent letters. The most frequent letter, A, should be assigned to the top leaf, 
as it has the shortest codeword length. This is evident. However, what about the 
second most frequent letter, C? It may seem like a good idea to assign it to the 
internal node marked with a blue square in Fig. 1.3, but this is not a valid solution. 
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G (111) 


Figure 1.3. Representation of an optimal source code via a binary code tree. 


The reason for this is that the codeword pattern would end. This is problematic, 
as there are only two leaves available, but four are required in total. Another two 
branches need to be generated from the internal node. C can be assigned to the 
second top leaf with a codeword length of 2. Similarly, the next frequent letter, 
T (or G), cannot be assigned to the internal node marked with a red triangle in 
Fig. 1.3. The node should have another set of two branches. The remaining letters 
T and G are assigned to the two remaining leaves. With this mapping rule, we 
achieve: 


E[length(f(S))] = P(S = A)length(f(A)) + P(S = C)length(f(C)) 
+ P(S = T)length(f(7)) + P(S = G)length(/(G)) 


1 1 1 1 
ae aa ats -3 = 1.75 = A(S). 
Channel coding theorem The channel coding theorem states that the maxi- 
mum number of bits that can be transmitted over a channel is its capacity, rep- 
resented by C. There is a mathematical definition for C, which involves several 
important concepts and notions that we will need to study. One of these impor- 
tant concepts is “mutual information.” We will delve into the definitions of these 
concepts later. 


Book outline The two theorems are at the core of the material covered in this 
book. The book is divided into three parts. In Part I, we will study the basic prin- 
ciples of the field, including (i) entropy, which is an essential component in estab- 
lishing the source coding theorem; (ii) mutual information, which is critical for 
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the channel coding theorem; and (iii) Kullback-Leibler (KL) divergence, a power- 
ful concept that has similar functions to mutual information and is widely used in 
fields like mathematics, statistics, and machine learning. By using entropy, we will 
demonstrate the source coding theorem. In Part II, we will prove the channel cod- 
ing theorem with the use of mutual information and present a code that satisfies 
the fundamental limit set by the channel coding theorem. 

Information theory, and these three crucial concepts in particular, form the foun- 
dation for solving numerous important problems across various fields such as com- 
munication, social networks, computational biology, machine learning, and deep 
learning. In Part III, we will examine the applications of information theory in the 
field of data science, with a specific emphasis on six major examples that highlight 
the fundamental limit and key concepts of information theory. 

The first of these applications is community detection (Girvan and Newman, 
2002; Fortunato, 2010; Abbe, 2017), a well-researched problem in data science 
with applications in social networks (such as Facebook, LinkedIn, and Twitter) 
and biological networks. The second is Haplotype phasing, one of the significant 
DNA sequencing problems that shares a similar structure with community detec- 
tion (Browning and Browning, 2011; Das and Vikalo, 2015; Chen et al., 2016a; Si 
et al., 2014). The third is a ranking problem which forms the basis of search engines. 
Google’s PageRank (Page et al., 1999) is a famous ranking algorithm that serves as 
the backbone of Google’s website search engine. 

In these three problems, we will demonstrate that the concept of fundamental 
limits plays a crucial role in addressing the problems and in the development of 
optimal algorithms. The last three applications are related to machine learning and 
deep learning: (i) supervised learning, one of the most widely used machine learning 
methods; (ii) Generative Adversarial Networks (GANs) (Goodfellow et al., 2014), a 
groundbreaking model for unsupervised learning; and fair classifiers (Larson et al., 
2016; Zafar et al., 2017; Cho et al., 2020), a timely and socially significant topic in 
machine learning. In particular, we will emphasize: (i) the central role of entropy 
and KL divergence in the design of a loss function for optimization in supervised 
learning; (ii) the fundamental role of KL divergence in the design of GANs; and (iii) 
the recently discovered role of mutual information in the design of fair classifiers. 
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1.2 Entropy and Python Exercise 


Recap In the previous section, we recounted out the story of Shannon’s founding 
of information theory. Motivated by the desire to unify the disparate communica- 
tion systems of the early 20th century, Shannon proposed the use of bits to rep- 
resent different information sources and established a unified two-stage architec- 
ture. The first stage transforms an information source into bits, while the second 
stage generates a signal that can be transmitted over a channel. With this frame- 
work in place, Shannon determined the limit on the amount of information that 
can be communicated, leading to the formulation of two fundamental theorems. 
The first of these theorems, the source coding theorem, defines the maximum 
compression rate of an information source, while the second theorem, the chan- 
nel coding theorem, defines the maximum number of bits that can be transmitted 
reliably. 


Outline Before delving into the two landmark theorems, we will first examine 
three crucial concepts that play central roles in the theorems: (i) entropy; (ii) 
mutual information; and (iii) Kullback-Leibler (KL) divergence. These concepts 
are essential in addressing many critical issues across a range of disciplines, includ- 
ing statistics, physics, computational biology, and machine learning. Therefore, it 
is advisable to familiarize oneself with the detailed properties of these concepts for 
various purposes. 

In this section, we will focus on the first concept: entropy. Our tasks will include: 
(i) reviewing the definition of entropy, providing intuitive explanations for the 
meaning of entropy to better understand why the maximum compression rate must 
be entropy, as stated in the source coding theorem; (ii) studying key properties of 
entropy that are useful in various contexts; and (iii) completing a Python exercise 
to compute entropy. In a later section, we will see how entropy factors into the 
proof of the source coding theorem. 


Definition of entropy The entropy is defined in relation to a random quantity. 
Specifically, it deals with the probability distribution ofa random variable (for scalar 
random quantities) or a random process (for vector quantities). For simplicity, we 
will begin by examining the case of a random variable, and then move on to cover 
the general case of a random process later. 

More precisely, the entropy is defined w.r.t. a discrete’ random variable. Let X be 
a discrete random variable and P(x) be its probability mass function (pmf). For 


3. We say that a random variable is discrete if its range (the set of values that the random variable can take on) 
is finite or at most countably infinite. 
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brevity, we employ a simpler notation p(x) to indicate Py (x). Let X (that we call 


“caligraphy ex”) be its range: the set of values that X can take on. The entropy is 
defined as: 


A(X) := Sp) log, bits. (1.3) 


1 
xEX px) 
Throughout this book, the logarithmic function most commonly used is the base 
2 logarithm. However, to simplify the notation, we will omit the “2” and use the 
logarithm function without any base specification. 

We introduce an alternative expression for the entropy formula. It is much sim- 
pler and thus easy to remember. In addition, it serves to simplify the proof of some 
important properties that we are going to investigate. Observe in (1.3) that the 
entropy is a weighted sum of log FE for different values of x’s. So it can be repre- 
sented as: 


1 
H(X) := E | log — 1.4 
= [ez] an 


where the expectation is taken over the distribution p(x) of X. 


Interpretation #1 We will highlight two commonly recognized interpretations 
of entropy that are intuitive and can provide some understanding of why entropy 
is associated with the maximum compression rate. The first is: 


Entropy is a measure of the uncertainty of a random quantity. 


This interpretation is supported by a tangible instance, as demonstrated below. 
Consider two experiments: (i) tossing a fair coin; and (ii) rolling a fair dice. One 
simple random variable that one can think of for the first experiment is a function 
that maps the head (or tail) event to 0 (or 1). Since the coin is fair, we have: 


0, w.p. l; 
x=] ° 


1, w.p. L, 


The abbreviation “w.p.” stands for “with probability”. On the other hand, a natural 
random variable in the second experiment is a function that maps a dice result to 
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the same number: 


1, w.p. 4; 
2, w.p. ł; 
ye 3, w.p. ł; 
4, w.p. z3 
5, w.p. 4; 
6, w.p. ra 


One may inquire about which random variable is more unpredictable. Intuitively, 
it appears that the second random variable is more uncertain. The entropy provides 
precise numerical evidence to confirm this intuition. Note that H(X) = 1 in the 
first experiment while H(X) = log 6 > 1 in the latter. The entropy plays a role to 
quantify such uncertainty. 

Let us give you another example. Suppose we have a bent coin. Then, it yields a 


different probability for the head event, say p # j: 


X= [o WP: p: (1.5) 
1, wp.l—p. 


Can it be inferred that this random variable is more unpredictable than the fair 
coin scenario? To examine this, let us contemplate an extreme situation in which 


p X 1, yielding the following: 


1 
l—p 


1 
ee oO — p) log ~ 0. (1.6) 


Here we used the fact that lim, 9+ p log 1 = 0. Remember LHospital’s theorem 
that you may learn from calculus (Stewart, 2015). According to this theorem, it can 
be concluded that the bent-coin scenario is unquestionably more certain. This is 
logically sound since a very small value of p(< 1) implies that the tail event occurs 
almost all the time, making the outcome highly foreseeable. 


Interpretation #2 The second interpretation concerns a method for eliminating 
uncertainty. To clarify this point, consider the following example: Imagine meet- 
ing a person for the first time. At this point, the person is entirely unknown to us. 
However, one can remove this uncertainty by asking questions. With each answer, 
randomness associated with the person can be eliminated. With enough questions, 
it is possible to gain complete knowledge about the individual. Therefore, the num- 
ber of questions needed to obtain comprehensive knowledge reflects the degree of 
uncertainty: the greater the number of questions required, the more unpredictable 
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the situation. This leads to: 


Entropy is intimately related to the number of questions 


required to uncover the value of X. 


In some cases, the number of questions (on average) precisely corresponds to H(X). 
The following is an example of such a scenario: 


1, w.p. $; 
2, w.p. 43 
X= ma 
3, Wp. 33 
4, w.p. ż. 


A straightforward calculation yields H(X) = 1.75. Assume that the questions are 
binary, requiring a yes or no answer. In this case, the minimum average number of 
questions needed to ascertain the value of X is H(X). The optimal approach for 
posing questions to achieve H(X) is as follows: start by asking if X is equal to 1. If 
the answer is yes, then X is 1; otherwise, ask if X is equal to 2. If the answer is yes, 
then X is 2; if not, ask if X is equal to 3. Let f(x) represent the number of questions 
required to determine X when X = x. This method results in: 


; _! 1 1 1 
EFW] = sf) + ZfQ) + sfB) + fA) 


1 1 
=>] -2 : -3= 1,75. 
5 +5 +3 3+3 3 75 


Some of you may recognize that this is exactly the same as the number that appears 
in the prior source code example. See (1.2) for details. 


Key properties The entropy has several important properties. We can identify 
them by making some observations. 

Recall the bent-coin example. See (1.5) for the distribution of the associated ran- 
dom variable X. Consider the entropy calculated in (1.6). Notice that the entropy 
is a function of p. Fig. 1.4 illustrates how H (X) behaves as a function of p. One can 
make two observations here: (i) the minimum entropy is 0; and (ii) the entropy is 
maximized when p = 4, i.e., X is uniformly distributed. 

Consider another example in which X € ¥ = {1,2,..., M} and is uniformly 
distributed: 


l, wp. a 
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O 4 L7 


Figure 1.4. The entropy of a binary random variable X with P(X = O) = p and P(X = 1) 
=1- p. 


In this case, 


M 
1 
H(X) = 2, eM = log M = log |X| 


x=1 


where || indicates the cardinality of ¥ (the size of the set). This leads to another 
observation: the entropy of the uniformly distributed random variable X € Æ is 
log |4|. 


The above three observations lead us to conjecture the following two: for X € X, 


Property 1: H(X) > 0. 
Property 2: H(X) < log |#|. 
These properties hold indeed. The first is easy to prove. Using the definition of 


entropy and the fact that p(X) < 1, we get: H(X) = Ellog x > Eflog 1] = 0. 
Using Jensens inequality, it is straightforward to prove the second property. This 


inequality is a popular and fundamental mathematical concept that underlies many 
results in information theory. Its formal statement is presented below. 


Theorem 1.2 (Jensens inequality). For a concave’ function f (-), 


= 


E (X)] < f(ELX)). 


4. We say that a function f is concave if for any (x1, x2) and A € [0, 1], Af) + 0 -AF e) < fa + - 
2)x2). In other words, for a concave function, the weighted sum w.r.t. functions evaluated at two points is 
less than or equal to the function at the weighted sum of the two points. See Fig. 1.5. 
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Ty E[X] T2 
Figure 1.5. Jensen’s inequality for a concave function: E[f(X)] < f(E[X]). 


Proof. The proof is immediate for a simple binary case, say X € V = {x1, x2}. 
By setting p(x1) = A and p(x2) = 1 — A, we get: ELX] = Ax) + (1 — A)x2 and 
EIS (X)] = Af (x1) + (A — Af (x2). The definition of concavity (see the associated 
footnote for the definition or Fig. 1.5) completes the proof. The generalization to 
an arbitrary ¥ can be done by induction. Try this in Prob 1.5. 


Using the definition of entropy and the fact that log(-) is a concave function, 
we get: 


1 
X) = E | log —— 
a tos | 
O w | ——— 
= BNE Lp) 


= log DP) as) 


= log |4| 


where the inequality is due to Jensen’s inequality. 


Joint entropy We will examine the entropy defined w.r.t. multiple (say two) 
random variables. Since this calculation involves multiple quantities, it is referred 
to as joint entropy, and is defined as follows: for two discrete random variables, 
XeXandYey, 


H(X, Y) = >) > Px, yy) os Bap 


xEX yey T J) 


where Py, y(x, y) denotes the joint distribution of (X, Y). Again, we employ a sim- 
pler notation p(x, y). The only distinction w.r.t. the single random variable case 
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is that the joint distribution p(x, y) comes into picture. Similarly, an alternative 
expression reads: 


i 1 
H(X, Y) = It fos | 


where the expectation is taken over p(x, y). 


Chain rule We highlight a significant characteristic of joint entropy, known as the 
chain rule, which demonstrates the correlation between multiple random variables. 
For the two random variable case, it reads: 


Property 3 (chain rule): H(X, Y) = H(X) + H(Y |X) 


where H(Y|X) is conditional entropy and defined as: 


1 
H(Y|X):= , “I O 
(VIX) : = 26 y! 85 Fred 


where the expectation is taken over p(x, y) and p(y|x) denotes a simpler notation of 
conditional distribution Py|x(y|x). The proof of the chain rule is straightforward: 


i 1 
H(X, Y) = It os | 


OE og at 
P(X )p(V1X) 


(2) E, ] 1 EIl lt 
cet TE GOO) | px) 


(c) 1 : 1 
= Ey |l E | log ———— 
x oe H los a5 | 


= H(X) + H(Y|X) 


ply), 
pix)? 
(4) follows from the linearity of expectation; and (c) follows from X` yey PCy) = 


where (a) follows from the definition of conditional probability (p(y|x) := 


p(x) (total probability law). The last step is due to the definition of entropy and 
conditional entropy. 

We provide an interesting interpretation on the chain rule. Remember that 
entropy is a measure of uncertainty. Hence, one can interpret Property 3 as fol- 
lows. The uncertainty of (X, Y) (reflected in H(X, Y)) is the sum of the following 
two: (i) the uncertainty of X (reflected in H(X)); and (ii) residual uncertainty in 
Y when X is known (reflected in H(Y|X)). See Fig. 1.6 for visual illustration. 
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H(X,Y) 


H(X)  H(Y|X) 


Figure 1.6. A Venn diagram interpretation of the chain rule. 


The interpretation of Property 3 becomes clearer when we map the Venn diagram 
to the amount of uncertainty associated with a random variable. The area of the 
blue circle represents H(X); the area of the red part represents H (Y |X); and the 
entire area represents H(X, Y). 


A remark on conditional entropy Another way to express conditional 
entropy is: 


H(YIX) = $ > p,y) log —— 
xEX yey py x) 


= > p&) > pO)! BDR T 5 


xEX yey 


= Si P@HYIX =x) 


xEX 


where the last equality is due to the conventional definition of: 


AY |X = = 
(YIX =x) := = 20m BD T 


Here H(Y|X = x) is the entropy defined w.r.t. Y when X = x. So the conditional 
entropy can be interpreted as the weight sum of H(Y|X = x). This interpretation 
serves to remember the formula as well as provides an easy way to calculate. 


Python exercise Let us investigate how to compute entropy, joint entropy and 
conditional entropy via Python. As per the definition (1.3) of entropy, one can 
calculate entropy from scratch via the following code. 
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import numpy as np 


def entropy(pX): 
return sum(pX*np.log2(1/pX)) 


# probability distribution of a binary random variable 
pX = np.array([1/2, 1/2]) 
printCentropy(pX)) 

1.0 


Alternatively one can use a built-in function in scipy.stats. 


from scipy.stats import entropy 


pX1 = np.array([1/2, 1/2]) # numpy.array 
PX2 = 1/2) list 
printCentropy(pX1, base=2)) 
printCentropy(pX2, base=2)) 

1.0 

1.0 


As for the input distribution, entropy can take either numpy.array or list. 
Using the above entropy function, we draw the entropy of a binary random 
variable as a function of p, as demonstrated in Fig. 1.4. 


import matplotlib.pyplot as plt 


p = np.arange(0.001,0.999,0.001) 
Hp = np.zeros(len(p)) 
for i,val in enumerate(p): 
pX = np.array([val, 1-val]) 
Hp[i] = entropy(pxX, base=2) 


plt.figure(figsize=(5,5), dpi=150) 
plt.plot(p,Hp) 

plt.xlabelÇp’) 

plt.ylabelC H(X)’) 

plt.titleC Entropy of a binary random variable’) 
plt.showQ 


The way to compute joint entropy is the same as that of entropy. The only 
distinction is that the cardinality of the range set grows. To see this, consider a 
simple two-random variable example where the joint distribution reads: p(x, y) = 
i 7 7 Z for (x,y) = (0,0), (0, 1), (1, 0), (1, 1), respectively. The joint distribu- 
tion is then represented as an array like [ż, 4, 7 3. This gives the computation of 
joint entropy as: 
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Entropy of a binary random variable as a function of p 


1.0 


0.8 


0.6 


H(X) 


0.4 


0.2 


0.0 
0.0 0.2 0.4 0.6 0.8 1.0 
p 


Figure 1.7. Python plotting: Entropy of a binary random variable. 


pXY = np.array([1/4, 1/4, 1/3, 1/6]) 
HXY = entropy(pXY, base=2) 
printCHXyY) 


1.959147917027245 


Next, we compute H(X) and H(Y|X) (conditional entropy) to verify the chain 
rule. 

# Compute p(x) 

pX = np.array([1/4+1/4, 1/3+1/6]) 

# Compute p(y|O) 

pY_xO = np.array([1/4, 1/4])/px[0] 

# Compute p(y|D 

pY_x1 = np.array([1/3, 1/6])/pX[1] 


# Compute H(X) 

HX = entropy(pX, base=2) 

# Compute H(Y|X)=\sum pC) *H(Y|X=x) 

HY_X = pX[O]*entropy(pY_xO,base=2) \ 
+ pX[1]*entropy(pY_x1,base=2) 


# Verify the chain rule: H(X, Y) = H(X) + HCY|X) 
printCHX+HY_X) 


1.9591479170272448 


Notice that H(X) + H(Y|X) is the same H(X, Y), taking © 1.9591. 
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Look ahead We can use the entropy concept to prove the source coding theorem, 
but not the channel coding theorem. To prove the channel coding theorem, we need 
to introduce another concept: mutual information. In the next section, we will delve 
into mutual information, covering its definition and several important properties. 
These properties not only serve to prove the theorem but also play critical roles 
in other fields. Additionally, we will examine another essential concept: the KL 
divergence. The KL divergence, together with entropy, is crucial in proving the 
source coding theorem. It has also played a significant role in other disciplines, 
such as serving as a distance measure between distributions in statistics. 
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1.3 Mutual Information, KL Divergence and Python 
Exercise 


Recap In the preceding section, we gained knowledge about entropy, which is 
crucial in proving the source coding theorem. Initially, we defined entropy for a 
single random variable and subsequently extended it to cases where multiple ran- 
dom variables are present. Furthermore, we explored the chain rule, a significant 
principle that governs the association among multiple random variables. Specifi- 
cally, for two random variables X and Y, the chain rule can be stated as follows: 


H(X, Y) = H(X) + A(Y|X) (1.7) 


where H(Y|X) denotes conditional entropy. Remember the definition of condi- 
tional entropy: 


HYO = $ POH YIX = x) (1.8) 
xEX 
where H(Y|X = x) is the entropy w.r.t. p(y|x). 

Towards the end, it was highlighted that to prove the channel coding theorem, 
an understanding of another significant concept, mutual information, is necessary. 
Additionally, we emphasized the need to delve into another important notion, the 
Kullback-Leibler (KL) divergence. 


Outline This section is dedicated to exploring two important concepts: mutual 
information and the KL divergence. It is divided into five parts. Firstly, we will 
begin with the definition of mutual information. Secondly, we will delve into the 
key properties of mutual information. Thirdly, we will examine the relationship 
between mutual information and the KL divergence. In the fourth part, we will dis- 
cuss how mutual information is related to the channel coding theorem. Finally, we 
will conclude this section with a Python exercise that involves computing mutual 
information and the KL divergence. 


Observation An interesting observation we made w.r.t. the chain rule (the Venn 
diagram interpretation in Fig. 1.8) brings about a natural definition for mutual 
information. First recall the interpretation. The randomness of two random vari- 
ables X and Y (reflected in the total area of two Venn diagrams) is the sum of 
the randomness of one variable, say X, (reflected in the area of the blue Venn dia- 
gram) and the uncertainty that remains about Y conditioned on X (reflected in the 
crescent-moon-shaped red area). By the chain rule, the crescent-moon-shaped red 
area can be represented as: H(Y|X). 

We see an overlap between the blue and red areas. The area of the overlapped 
part depends on how large H(Y|X) is: the larger the overlapped area, the smaller 
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H(X,Y) = total area 
area area 


Figure 1.8. A Venn diagram interpretation of the chain rule. 


H(Y|X). A low value of H(Y|X) implies a high level of dependence between X 
and Y. Therefore, the larger the area of overlap between the two Venn diagrams, the 
greater the dependence between them. Consequently, the overlapping area quanti- 
fies the degree of shared information between X and Y. 


Definition of mutual information This observation leads to the definition of 
the overlapped area that captures the shared information: 


I(X;Y) := H(Y) — H(Y|X). (1.9) 


In the literature, this notion is called mutual information instead of shared (or 
common) information. 

From the picture in Fig. 1.8, one can define it instead as /(X; Y) := H(X) — 
H(X|Y) because the alternative indicates the same overlapped area. By convention, 
we follow the definition of (1.9) though: The entropy of the right-hand-side term 
inside /(-;-) minus conditional entropy of the right-hand-side term conditioned 
on the left-hand-side term. 


Key properties Remember that entropy respects: (i) the non-negativity 
H(X) > 0;and (ii) the cardinality bound H (X) < log |4’|. Similarly mutual infor- 
mation exhibits the following properties: 

Property 1: I(X;Y) =I(Y;X); 

Property 2: I(X;Y) > 0; 

Property 3: I(X;Y) = 0 SS X ALY. 


The first property (named the symmetry property) is obvious from the picture. For 
rigorousness, we leave the proof as below: 


I(X;Y) := H(Y) — H(YIX) 


2 HY) - (HZ, Y) - H(X) 


2 HY) + H(X) — HY) + HUY) 
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= H(X) — H(X\|Y) 


O IYO 


where (a) and (6) follow from the chain rule (1.7); and (c) is due to the definition 
of mutual information (1.9). 

The second property is also straightforward, as mutual information captures the 
overlapped area and therefore it must be non-negative. But the proof is not that 
simple. It requires a bunch of steps as well as the usage of an important inequality 
that we learned in the previous section. That is, Jensen’s inequality. We will prove 
the second property in the sequel. 

The third property also makes an intuitive sense. Mutual information being 0 
means no correlation between X and Y, implying the independence between the 
two. But the proof is not trivial either. We will provide the proof right after proving 
the second property. 


Proof of /(X;Y) > O & its implication Starting with the definition of mutual 


information, we obtain: 


I(X;Y) := H(Y) — H(Y|X) 


2 By [ios 5] -Eer [s som 
= It og —— |= 
rml | ae) 


B Ex, Y fos |- Ex y [iog aa | 
(Y) pV |X) 
Op og =] 
p(Y) 


(d) py) 
~ E |- los OX) 


oo D ) 
— log E 
aie Faas 


pO) 
— = | ` —— 
og} > Dp DCs) 


xEX yey 


2 tog 1 5 pep) 


xEX yey 


= —loe 1 = 0 
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where (a) follows from the definition of entropy and joint entropy; (b) is due to the 
total probability law X ex p(x, y) = p(y); (c) is due to the linearity of expecta- 
tion; (d) comes from log x = — log i, (e) is due to the fact that — log(-) is a convex 
function and applying Jensen’s inequality; (f) follows from the definition of con- 
ditional distribution p(y|x) := a ), 


distribution: > pex Po) = Dyey p(y) = 1. 
This non-negativity property has another intuitive implication. Applying the 


and (g) is due to an axiom of the probability 


definition of mutual information and then re-arranging the two terms H(Y) and 
H(Y |X) properly, we get: 


H(Y) > H(Y|X). (1.10) 


Remember one interpretation of entropy: a measure of uncertainty. So H(Y) can 
be viewed as the uncertainty of Y, while H(Y|X) being interpreted as the residual 
uncertainty in Y after X being revealed. Our intuition then says: Given side infor- 
mation like X that is given as conditioning, we know more about Y (the uncertainty 
is removed further) and therefore, such conditional entropy must be reduced. In 
short, conditioning reduces entropy. The above property proves this intuition. 
Some curious readers may want to ask: What if X is realized as a certain value 
X = x? In sucha case, does the particular form of conditioning still reduce entropy: 


H(Y) > H(YIX =x)? (1.11) 


Please think about it while solving Prob 1.9. 


Proof of /(X; Y) = O > X IL Y To prove this, first recall one procedure that 
we had in the process of proving the second property: 


I(X;Y) := H(Y) — H(Y|X) 


=E |= oe a) | 
p(YIX) 


n| 20) 
Sm | Fos | 


Remember that the last inequality is due to Jensen’s inequality. As you will figure 


out while solving Prob 1.5, the sufficient and necessary condition for the equality 
to hold in the above is: 


err = c (constant). 
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The condition then implies that 


PO) = ply) Vee X,Y €Y. 


Using the axiom of probability distribution (the sum of the probabilities being 1), 
we get c = 1 and therefore: 


PQ) = plx) Vee &,Yy EY. 


Due to the definition of independence between two random variables, the above 
implies that X and Y are independent. Hence, this completes the proof: 


I(X;Y) =0 e> X IY. 


Interpretation on /(X; Y) Let us say a few words about /(X; Y). Using the chain 


rule and the definitions of entropy and joint entropy, one can rewrite I (X; Y) := 
A(Y) — H(Y |X) as 


I(X;Y) = H(Y) + H(X) — H(X, Y) 


— Ell 1 ElI l mn 1 
7 am | + [es z5] - |ie- (1.12) 


=E tog Dari of | 
POPOY) 


This leads to the following observation: 


p(X, Y) close to p(X)p(Y) = > I(X;Y) © 0; 
p(X, Y) far from p(X)p(Y) = > I(X;Y) far above 0. 


This enables us to interpret mutual information as a sort of distance measure that 
captures how far the joint distribution p(X, Y) is from the product distribution 
p(X)p(Y). In statistics, there is a well-known divergence measure that reflects a 
distance between two distributions. That is, KL divergence. So mutual information 
can be represented as the KL divergence. Before detailing the representation, let us 
first introduce the definition of the KL divergence.’ 


5. The classic book “Elements of Information Theory” by Cover and Thomas (Cover, 1999) employs a different 
naming for the KL divergence. That is, relative entropy. This naming is popular yet mainly in the information 
theory literature. It is not the case in other societies; the naming of the KL divergence is more prevalent. 
Hence, we have chosen the naming of the KL divergence in this book. 
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Definition of the KL divergence Let Z € Z bea discrete random variable. 
Consider two probability distributions w.r.t. Z: p(z) and q(z) where z € Z. The 
KL divergence between the two distributions are defined as: 


KL(p\lg) := Sr) iba ple) 


ZEZ qlz ) 


q(Z) 
Mutual information in terms of the KL divergence Applying the defini- 
tion (1.13) to (1.12), we obtain: 


T p&Y) | 
X; = E | log —— 
ee) [to rep) 


(1.13) 


= K(X.) log oe at) | 
* Xp) 


@ a | 22] (1.14) 
=Z E og —— 
02) |B Z 


© KLEIZIZ) 
KL(p(X, Y) |p) p(Y)) 


where (a) comes from our own definition: Z := (X, Y) (note that p(x)p(y) is a 
valid probability distribution. Why?); and (b) is because of the definition of the KL 


divergence. 


Properties of the KL divergence As mutual information has the three prop- 
erties, the KL divergence has three similar properties: 


Property 1: KL(pllg) A KL(allp); 
Property 2: KL(pllg) 2 0; 
Property 3: KL(pllg) = 0 = p = 4. 


The first property of the KL divergence is different from mutual information in 
that it is not symmetric. The definition of the KL divergence in (1.13) only takes 
the expectation over the first probability distribution p, which breaks symmetry. 
The second and third properties, on the other hand, are similar to those of mutual 
information and their proofs are also similar. Please refer to Prob 1.12 for more 
details. 
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(a) binary erasure channel (BEC) (b) erasure probability p 


Figure 1.9. Binary erasure channel. 


Connection between mutual information and channel capacity We 
establish a connection between mutual information and the channel coding theo- 
rem. To illustrate this connection, we consider a concrete exemplary channel called 
the binary erasure channel (BEC). The BEC is the first toy-example channel that 
Shannon proposed. It has an input, X, which is binary and takes values 0 or 1. The 
output, Y, is ternary and takes values 0, 1, or an erasure symbol (denoted by e). As 
discussed in Section 1.1, the channel introduces uncertainty in the form of noise, 
which can be characterized by the conditional distribution p(y|x). In the BEC, the 
output is the same as the input with probability 1 — p, otherwise, it takes an erasure 
symbol regardless of the value of x. The conditional distribution is given by: 


1—p, for (x,y) = (0,0); 


Pp for (x,y) = (0, e); 

POI) = 1p for (x,y) = (1, e); (1.15) 
l—p, for (wy) = (1,1); 
0, otherwise. 


A pictorial description of the BEC is in Fig. 1.9(4). A value placed above each arrow 
indicates the transition probability for a transition reflected by the arrow. 

To see the connection, let us compute mutual information between the input 
and the output. 


I(X;Y) = H(Y) - H(Y1X). 
As you can see, it requires a computation of the entropy H (Y) of a ternary random 


variable Y. It turns out that Æ (Y) is a bit complicated to compute, while a simpler 
calculation comes from an alternative expression: 


I(X;Y) = H(X) — H(X\Y). (1.16) 
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In order to compute H(X), we need to know about p(x). However, p(x) is not 
given. So let us make a simple assumption: X is uniformly distributed, i.e., (in 
other words), X ~ Bern(4). Here “~” refers to “is distributed according to”; Bern 
denotes the distribution of a binary (or Bernoulli®) random variable; and the value 
inside Bern(-) indicates the probability that the variable takes 1, simply called the 
Bernoulli parameter. Assuming X ~ Bern(), the entropy of X is H(X) = 1 and 
the conditional entropy H(X|Y) is calculated as: 


A(X|Y) Ë PO = 9H (XY =e) + P(Y 4 H (XY #e) 


= P(Y = e)H(X|Y =e) 


where (a) is due to the definition of conditional entropy; (b) follows from the fact 
that Y # e completely determines X (no randomness) and therefore H(X|Y # 
e) = 0; and (c) follows from the fact that Y = e does not provide any information 
about X and hence X|Y = e has the same distribution as X, so H(X|Y = e) = 
H(X) = 1. Applying this to (1.16), we get: 


I(X;Y) = H(X) — H(X|Y) =1—p. 


This is where we can see the connection between mutual information and channel 
capacity C: the maximum number of bits that can be transmitted over a channel. 
It turns out that /(X; Y) = 1 — p is the capacity of the BEC. Remember that we 
assume the distribution of X in computing /(X; Y). Fora general channel indicated 
by an arbitrary p(y|x), such p(x) serves as an optimization variable and the channel 
capacity is characterized as: 


C = max I (X; Y). (1.17) 
P(x) 
This is the statement of the channel coding theorem. We see that mutual informa- 
tion indeed characterizes the channel capacity. Later in Part II, we will prove the 
theorem. 


Python exercise Finally we explore how to compute mutual information and 
the KL divergence via Python. As per the definition of mutual information together 


6. The binary random variable is named after Jacob Bernoulli, a Swiss mathematician from the 1600s who used 
this simple random variable to discover one of the foundational laws in mathematics, the Law of Large Num- 
bers (LLN). Hence, the binary random variable is commonly referred to as the Bernoulli random variable. 
Later, we will have an opportunity to explore the LLN in more detail. 
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with the way to compute entropy and conditional entropy (that we learned in 
Section 1.2), one can calculate mutual information. Let us do some exercise with 
the same example introduced in the previous section: p(x,y) = ip i ara for 
(x,y) = (0,0), (0, 1), (1, 0), (1, 1). First we compute /(X; Y) = H(Y)—H(Y |X). 


import numpy as np 
from scipy.stats import entropy 


# Compute p(y) 

pY = np.array([1/4+1/3, 1/4+1/6]) 
# Compute H(Y) 

HY = entropy(pY, base=2) 


# Compute p(x) 

pX = np.array([1/4+1/4, 1/3+1/6]) 

# Compute pcy|O) 

pY_xO = np.array([1/4, 1/4])/oX[0] 

# Compute p(y|D 

pY_x1 = np.array([1/3, 1/6)/pX[1] 

# Compute H(Y|X)=\sum pC) *H(Y|X=x) 

HY_X = pX[O]*entropy(pY_xO,base=2) \ 
+ pX[1]*entropy(pY_x1,base=2) 


# Compute I(X;Y)=H(Y)-H(Y|X) 
IXY = HY - HY_X 
print(IXY) 


0.020720839623907916 


We also compute /(Y;X) = H(X) — H (X| Y) to do sanity check for the symmetry 
property. 


# Compute p(x) 

pX = np.array([1/4+1/4, 1/3+1/6]) 
# Compute H(X) 

HX = entropy(pX, base=2) 


# Compute p(y) 

pY = np.array([1/4+1/3, 1/4+1/6]) 
# Compute p(x|O) 

pX_yO = np.array([1/4, 1/3)/pYLO] 
# Compute p(x|D 

pX_y1 = np.array([1/4, 1/6])/pYT1] 
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# Compute H(X|Y=\sum p(y)*HC|Y=y) 
HX_Y = pY[O]*entropy(pX_yO,base=2) \ 
+ pY[1]*entropy(pX_y1,base=2) 


# Compute I(Y:X)=H(X)-H(X\Y) 
IYX = HX - HX_Y 
print(IYX) 


0.02072083962390825 


Up to a numerical precision error on a computer, we observe that 7 (X; Y) is equiv- 
alent to Z(Y; X). 
Using the definition (1.13) of the KL divergence, one can implement it from 
scratch. 
def kI(p,q): 
return sum(p*np.log2(p/q)) 


We employ this function to verify the relationship (1.14) between mutual informa- 
tion and the KL divergence. 


# Compute p(x,y) 

PXY = np.array([1/4, 1/4, 1/3, 1/6]) 

# Compute p)p(y) 

pXpY = np.array([pXLO]*pYLO],exLO]*pYT1], 
PX[1]*pYLO]pXii]*pYLiI) 

# Compute KL(PXY\|pxXpY) 

printckI(pXY,oXpY)) 


0.020720839623908215 


Below we check that the symmetry property does not hold for the KL divergence. 


printckI(pXY,oXpyY)) 
print(kl(oXpY,pXY)) 


0.020720839623908215 
0.020945827042758484 


For computation of the KL divergence, one can alternatively employ a built-in 
function rel_entr provided in the scipy.special package. The rel_entr employs the 
natural logarithm instead of log base 2. It returns a list of all the associated values 


p(x) In ga, Hence, we need a proper conversion. 


from scipy.special import rel_entr 


kl_builtin = rel_entr(pxY,pxXpyY) 
print(sum(kl_builtin)) 
# To convert into log base 2 
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print(sumckl_builtin)/np.log(2)) 
print(kI(pXY,pXpyY)) 


0.014362591564146779 
0.020720839623908218 
0.020720839623908215 


Look ahead We have completed our exploration of the key concepts in infor- 
mation theory. Moving forward to the next section, we will begin the process of 
proving the source coding theorem. 
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Problem Set 1 


Prob 1.1 (Bits) In Section 1.1, we learned that bits is a common currency of 
information that can represent any type of information source. Consider an image 
signal that consists of many pixels. Each pixel is represented by three real val- 
ues which indicate the intensity of red, green and blue colors respectively. We 


assume that each value is quantized, taking one of 256 equal-spaced values in [0, 1), 
Le 0 1 2 254 255 

“~ 256? 256? 256?" *? 256’ 256° 
source. 


Explain how bits can represent such an image 


Prob 1.2 (Digital communication architecture) Draw the digital communi- 
cation architecture that Shannon came up with. Also, point out the digital interface. 


Prob 1.3 (Source coding example) Let S be a discrete random variable with 
the probability distribution: 


1, wp. 0.4; 
2, w.p. 0.2; 
S= 43, w.p. 0.2; 
4, w.p. 0.1; 
5, w.p. 0.1. 


Consider a source code that maps S € {1,2,...,5} to codeword f (S). 


(a) Calculate the entropy of the random variable S. 

(b) Using the binary code tree that we learned in Section 1.1, construct a source 
code that minimizes the expected codeword length E [length(f(S))]. 

(c) Compare the expected codeword length of your code with H(S). Which 
one is smaller? Also explain why. 


Prob 1.4 (Channel coding example) Consider a binary erasure channel in 
which input X e {0, 1}, output Y € {0, 1, erasure }, and Y = X with probability 
1—pand Y = erasure with probability p where p € [0, 1]. An information theorist 
claims that the capacity of the erasure channel is a sole function of p, regardless of 
transmission and reception strategies. Is the claim true? Also explain why. 


Prob 1.5 (Jensen’s inequality) Suppose that a function f is concave and X is 
a discrete random variable. Show that 


EEX] < fC 
Also identify conditions under which the equality holds. 


= 


i[X]). 


iS 
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Prob 1.6 (An empirical estimate of entropy) The table below shows the 
frequency of letter usage in a particular sample of an English text. Suppose that the 
text sample is sufficiently large such that the frequencies are precise enough. Then, 
one natural way to estimate the probability mass function (pmf) of English letter 
is to employ such frequencies: 


# of A's in the text 
P(English letter = A) % = 0.0817. 
Engines ) total # of letters in the text 


Please see below for the estimates of other letters. Using Python, compute the 
entropy of English letter with these estimates. 


A | 8.17% || H | 6.09% || O | 7.51% || V | 0.98% 
B | 1.49% || I | 6.97% || P | 1.93% || W | 2.36% 
C | 2.78% || J | 0.15% | Q | 0.10% || X | 0.15% 
D | 4.25% || K | 0.77% || R | 5.99% || Y | 1.97% 
E | 12.7% || L | 4.03% || S | 6.33% || Z | 0.07% 
F | 2.23% || M | 2.41% || T | 9.06 % 

G | 2.02% || N | 6.75% || U | 2.76% 


Prob 1.7 (Joint entropy) Suppose X and Y are binary random variables with 
P(X = 0) = 0.2 and P(Y = 0) = 0.4. A student claims that the joint distribution 
that maximizes joint entropy H(X, Y) is: 


P(X = 0, Y = 0) = 0.08; 


P(X = 0, Y = 1) = 0.12; 


P(X = 1, Y = 0) = 0.32; 


P(X = 1, Y = 1) = 0.48. 
Prove or disprove it. 


Prob 1.8 (Independence) Suppose that two binary random variables X; and X2 


satisfy: 


, ; 1 
P(X = ġ, X% = ù) = Z 


for all possible sequence patterns (71, 72) € {(0, 0), (0, 1), (1, 0), (1, 1)}. Show that 
Xı and X are independent and identically distributed (i.i.d.), each taking 0 or 1 
with probability 5. Also compute H (X1, X2) and H (X1) + H(X2). 
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Prob 1.9 (Conditional entropy) Let X and Y be discrete random variables. 
(a) Show that 


H(Y) > H(Y|X). 


Do you think that the above inequality make an intuitive sense? If so, 
explain why. 
(b) A curious student claims that for any x € V 


H(Y) > H(V|X = x). 


Either prove or disprove it. 


Prob 1.10 (Chain rule & conditional entropy) Let {X;} be a discrete random 
process with a joint distribution p(x1, x2, . . ., Xn). In view of the entropy (defined 
w.t.t. a single random variable) and the joint entropy (defined w.r.t. two random 
variables), a natural way to define the entropy for a random process is as follows: 


1 
A(X, X2,...,X,) :=E fio | 
: £ 5 aea) 


(a) Derive the chain rule for the random process: 


A(X, X2,...,Xn) 
= A(X) + A(X2|X)) +--+ A(X, |X, X2, eis ,Xn—1). 


(6) Show that when X; and Xz are independent, 
A(X2|X1) = HX). 


Considering the interpretation for conditional entropy that we learned in 
Section 1.2, this result makes a perfect sense. When Xj has nothing to do 
with X (statistically speaking, being independent), the uncertainty of X2 
remains the same whether X; is revealed. 


(c) Let A(X2|X, = x1) := Eo(x2|x1) [log awe | Show that 


A(X2|X1) = > pe) XIX = x1) 


xj EX] 


where 1 indicates the range of X}. 


36 Source Coding 


Prob 1.11 (Mutual information) Recall the mutual information that we defined 
in Section 1.3: 


I(X;Y) := H(Y) — H(Y|X) 


where X and Y denote discrete random variables. One interpretation that we made 
is that mutual information captures common information between the two random 
variables involved, as it represents the overlapped area in the Venn diagram. Con- 
sider another discrete random variable Z. 


(a) A curious student claims that if X and Y are independent, so is it even when 
Z is given: 


I(X;Y) = 0 = IX; Y|Z) =0 


where /(X; Y|Z) := H(Y|Z) — A(Y|Z,X). Either prove or disprove it. 

(b) A creative student wishes to capture common information across three ran- 
dom variables. To this end, the student defines triple mutual information 
as follows: 


I(X;Y; Z) := 1(X; Y) —1(X3 Y|Z). 


And then she/he interprets I (X; Y|Z) as the overlapped area between two 
parts, each being reflected in H(X|Z) and H (Y |Z) respectively. With this 
interpretation, the student feels happy about her/his definition because in 
that way /(X; Y; Z) indeed indicates the overlapped area between three 
circles (each reflected in H(X), H(Y), H(Z)). With the faith that the area 
must be non-negative, the student claims: 


I(X; Y;Z) > 0, 
as in the conventional case /(X; Y) > 0. Either prove or disprove it. 


Prob 112 (Kullback-Leibler divergence) Let p and q be two distributions 
on X. Let 


— E pX) 
KL(p\lq) := Ep tog mal ; 


(a) Either prove or disprove that KL(p||qg) = KL(q|lp). 
(b) Show that KL(pllg) > 0. 
(c) Show that the equality in the above inequality in part (b) holds if and only 


ifp = q. 
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Prob 1.13 (Mutual information vs. KL divergence) In Section 1.3, we learned 
about one specific yet insightful expression that connects mutual information to the 
KL divergence: 


I(X;Y) = kKL(Px,y||PxPy) 


where P(x) and Py (y) indicate probability distributions of discrete random vari- 
ables X € X and Y e€ J, respectively; and Px,y(x, y) denotes the joint distri- 
bution. There is another insightful expression that relates mutual information to 
the KL divergence. In this problem, you are asked to establish the expression. The 
expression gives insights into GANs (Goodfellow et al., 2014) and fair classifiers 
that we will study in Part III. 

(a) Let Py|x=x(y) be the conditional distribution of Y given X = x. Show 

that 


IX; Y) = >) Px@)KL@Pyjx=rllPy). 
xEX 


(b) Suppose X ~ Bern(ż) and 
Y= Yreal> if X = 1; 
Yiake> if X = 0 


where Yea E€ Y and Yke € Y denote other discrete random variables. 


Show that 
IX; Y) = JSP Yeal IP Yake) 


where JS(P Yea llPYiake) is the Jensen-Shannon divergence (another well- 
known divergence measure in information theory and statistics) defined as: 


P Yeal + P Yiake 
2 


1 Py. +P 
ob JKL (Pral real e), 


1 
JSP Yeal IP Yake) = a (Prien | 


Remark: Those who are familiar with GANs may be able to see a connection 
between GANs and mutual information. Otherwise, don’t worry. We will 
elaborate the connection in Part III. 


Prob 1.14 (Mutual information expressed in terms of optimization) Sup- 
pose X ~ Bern(ż) and 


Year ifX = l; 
Y = 
Yiake> if X = 0 
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where Yea € Y and Yiake € Y denote discrete random variables with Py,,.,(-) and 
P Yake C), respectively. In Prob 1.13, it was shown that 


I(x; Y) = JSP Yea IP Yake) 


1 Py... + Py, 1 Pye + Pr 
= KL P real fake KL P real fake : 
2 ( Yreal | 2 TF 2 Yfake | 2 


Show that 


1 
I(x; nE 5 P Yea (Jrea) log D( Jreat) 
JrealeY 
(1.18) 


1 
É 2 Pro (rake) log (1 — D(jrake)) + HX). 
Jfake€ 


Remark: Those who are familiar with GANs may be able to see a closer connection 
between GANs and mutual information. Don’t you see that connection yet? Don’t 
worry. You will see more details later in Part HI. 


Prob 1.15 (Conditional entropy) Let X and Y be random variables that take 
values in finite sets X and Y} respectively. You are given that H(X) = 11 and 
A(Y |X) = H(X|Y). A student claims that |V| > 3. Either prove or disprove the 
claim. 


Prob 116 (Chain rule) Let {X;} be a discrete random process. A student 
claims that 


A(X%,...,Xn) < 


1 n 
A(X, ...,Xj—1, Xj415..-5Xn). 
ae (Xj 1>Xj41 ) 


Prove or disprove this statement. 


Prob 1.17 (A lower bound) Suppose that random variables X, Y, X, Y take val- 
ues on the same set, and X and Y are independent. Let E = 1{X # Y} and 
E = 1{X + Y} where 1{-} denotes an indicator function that returns 1 when the 
event is true; 0 otherwise. Show that 


I(X; Y) > KL (Pel|Pz) — KL(Px|IPz) — KL@Py||P;). 


Prob 1.18 (Mutual information in the play-off game) The play-off is a five- 
game series that terminates as soon as either team wins three games. Let X be the 


random variable that represents the outcome of a play-off between teams A and B; 
possible values of X are AAA, BABAB and BBAAA. Let Y be the number of games 
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played, which ranges from 3 to 5. Assuming that A and B are equally matched and 
that the games are independent, calculate /(X; Y). 


Prob 1.19 (True or False?) 


(a) Let X be a discrete random variable taking values in an alphabet 4 where 
|X| > 8. The value of X is revealed to Alice, but not to Bob. Bob wishes to 
figure out the value of X by asking to Alice questions of the following type: 
if X = i, say 0; if X = f, say 1; if X = &k, say 2; otherwise, say 3. Then, the 
minimum number of questions (on average) required to uncover the value 
of X is in between 2H (X) and 2H(X) + 1. 

(b) Consider weathers in Daejeon and Seoul. Assume that Daejeon’s weather is 
sunny w.p. 0.3 and cloudy w.p. 0.7; Seoul’s weather is the same as Daejeon’s 
w.p. 0.3, and different w.p. 0.7. Conditioned on Daejeon’s weather being 
cloudy, Seoul’s weather is more predictable than that without any informa- 
tion on Daejeon’s weather. 

(c) In Section 1.3, we learned that conditioning reduces entropy: H(Y|X) < 
H(Y) for discrete random variables (X, Y). The same argument holds with 
regard to mutual information, i.e., conditioning reduces mutual informa- 
tion: H(Y|Z) — H(Y|Z, X) =: I(X; Y|Z) < I(X;Y) for any discrete 
random variable Z. 

(d) IfX and Y are independent random variables, then /(X; Y) = 0. However, 
the converse does not always hold. 

(e) Consider triplet mutual information defined w.r.t. three random variables: 
I(X; Y; Z) := I(X; Y)—I (X; Y|Z). As in the conventional case /(X; Y) = 
I(Y; X), the symmetry holds: 


T(X3Y;Z) =1(X3Z; Y) =--- = I(Z; Y;X). 
(f) Let p and q be distributions. Define: 
H(p, q) = H(p) + KL(pllg) 


where H(p) indicates the entropy of a random variable having the distri- 
bution of p. Then, H(p, q) is convex in q. 
(g) For any discrete random variables (X, Y, Z), 


H(X, Y, Z) + HY) < H(X, Y) + H(Y,Z). 


(4) Suppose two random variables X € ¥ and Y € yY are independent, |V| = 
IV] = 2 and ¥ N Y = G. Then, 


H(X + Y) = H(X) + H(Y). 
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Let X be a discrete random variable taking values in an alphabet ¥ where 
|X| > 6. The value of X is revealed to Alice, but not to Bob. Bob wishes to 
figure out the value of X by asking to Alice questions of the following type: if 
X =i, say 0; if X = j, say 1; otherwise, say 2. Then, the minimum number 
of questions (on average) required to determine the value of X cannot exceed 
the entropy H(X). 

Suppose X and Y are binary random variables with P(X = 0) = 0.2 and 
P(Y = 0) = 0.4. The joint distribution that maximizes the joint entropy 
H(X, Y) is: 


P(X = 0, Y = 0) = 0.08; 


P(X = 0, Y = 1) = 0.12; 


P(X = 1, Y = 0) = 0.32; 


P(X = 1, Y = 1) = 0.48. 


Let X be a discrete random variable taking values in V. For an one-to-one 
mapping function f and an arbitrary function g defined on 1, 


HFX) = H(X|g(X)). 


Suppose that X and Y are independent binary random variables. Then, 
there exists a distribution of p(x, y) such that 


H(X + Y) = log3. 
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1.4 Source Coding Theorem for i.i.d. Sources (1/3) 


Recap In the previous sections, we delved into the fundamental concepts of infor- 
mation theory, including Shannon’s two-stage architecture that involves splitting 
the encoder into a source encoder and a channel encoder. The purpose of this archi- 
tecture is to convert information from different sources into a common currency: 
bits. Shannon’s work resulted in two theorems that describe the efficiency of the two 
encoder blocks and, in turn, limit the amount of information that can be transmit- 
ted over a channel. These theorems are known as the source coding theorem and 
the channel coding theorem. Having established these key concepts, we are now 
prepared to prove the theorems. Over the next five sections, including this one, we 
will focus on proving the source coding theorem. 


Outline The source coding theorem quantifies the minimum number of bits 
needed to represent an information source without losing any information. The 
information source can be composed of various elements, such as dots and lines (as 
in Morse code), English text, speech signals, video signals, or image pixels. There- 
fore, it consists of multiple components. For instance, a text is made up of multiple 
English letters, and speech signals contain multiple points, each indicating the sig- 
nal’s magnitude at a specific time instant. From the perspective of a receiver who is 
unaware of the signal, it can be considered a random signal. As a result, the source 
is modeled as a random process comprising random variables. Let {X;} denote the 
random process, where X; represents a “symbol” in the source coding literature. To 
simplify matters, let us begin with a simple scenario in which X;’s are independent 
and identically distributed (i.i.d.). We denote by X a generic random variable that 
represents each individual instance of X;. The source coding theorem in the i.i.d. 
case is stated as follows: 


The minimum number of bits required to represent 
the i.i.d. source Xi per symbol is H(X). 


In the present section, we will make an effort to prove this theorem. After com- 
pleting the i.i.d. case, we will extend it to a more realistic non-i.i.d. distribution 
that the source may follow. 


Symbol-by-symbol source encoder Since an information source consists of 
multiple components, an input to source encoder contains multiple symbols. So the 
encoder acts on multiple symbols in general. To understand what it means, consider 
a concrete example where source encoder acts on three consecutive symbols. What 
it means by acting on multiple symbols is that an output is a function of the three 
consecutive symbols. But for simplicity, we are going to consider a much simpler 
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case for the time being in which the encoder acts on each individual symbol, being 
independent of other symbols. It means that the encoder produces bits in a symbol-by- 
symbol basis: a symbol X yields a corresponding binary string, and independently 
another binary string w.r.t. the next symbol X2 follows, and this goes on for other 
follow-up symbols. The reason that we consider this simple yet restrictive setting 
is that this case provides enough insights into a general case. It contains every key 
insight needed for generalization. Building upon the insights that we will obtain 
from this simple case, we will later address the general case. 

The simple case allows us to simplify notations. First it suffices to focus on one 
symbol, say X. The encoder is nothing but a function of X. Let us denote the 
function by C. Please do not be confused with the same notation that we used 
to indicate channel capacity. The reason that we employ the same notation is that 
the output C(X) is called codeword. Let (X) be the length of codeword C(X). 
For example, consider X € {a, b,c, d} in which C(a) = 0, C(b) = 10, C(e) = 
110, C(d) = 111. In this case, €(a) = 1, €(6) = 2, €(c) = 3, £ (d) = 4. Note that 
€(X) is a function of a random variable X, hence it is also a random variable. So 
we are interested in a representative quantity of such varying quantity, which is the 
expected codeword length: 


EEO] = >) pA). 


xEX 


An optimization problem The efficiency of source encoder is well reflected in 
the expected codeword length. Hence, we wish to minimize the expected codeword 
length. The optimization problem of our interest is then: 


min > p(x)e(x). (1.19) 
ee) xEX 


There are many ways to estimate the distribution p(x) of an information source. 
See Prob 1.6 for one way. Assume that p(x) is given. In this case, €(x)’s are only 
variables that we can optimize over. 

Next, consider constraints that the optimization variables €(x)’s are subject to. 
The obvious constraints are: (x) > 1 and €(x) € N. Are these constraints enough? 
No. If they were enough, the solution to this problem would become trivial. It 
would be 1. One can set €(x) = 1 for all x’s to obtain 1. But this is too good to be 
true. In fact, there is another constraint on €(x), concerning the condition that a 


valid code should satisfy. 


A naive condition: Non-singularity For the validity of a code, the encoder 
function must be one-to-one. The reason is that if it is not one-to-one, there is 
no way to reconstruct the input from the output. In the source coding literature, 
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a code is said to be non-singular if it is one-to-one mapping. Mathematically, the 
non-singularity condition reads: 


C=C) = x=. 
Here is an example that respects this condition: 
C(a)=0; C(b)=010; C(A) =01; C(d)=10. (1.20) 


Note that every codeword is distinct, ensuring one-to-one mapping. 

Is this non-singularity condition enough to ensure the validity of a code? Unfor- 
tunately, no. What we care about is a sequence of multiple symbols. What we get in 
the output is the sequence of binary strings which corresponds to a concatenation 
of such multiple symbols: X1X2---X, => C(X%1)C()--- C(X,). Remember 
we assume the symbol-by-symbol encoder; hence we get C(X1)C(X2) - -- C(X,) 
instead of C(X,X2---X,). By non-singularity of the extended code, one should 
be able to reconstruct the sequence X1X2 ---X, of input symbols from that out- 
put C(X1)C(%2)--- C(X,). But in the above example (1.20), there is ambiguity in 
decoding the sequence of input symbols. Here is a concrete example where one can 
see the ambiguity. Suppose that the output sequence reads: 


output sequence: 010 
Then, what are the corresponding input sequence? One possible input would be “4” 
(C(b) = 010). But there are some other patterns that yield the same output: “ca” 
(C(c)C(a) = 010) and “ad” (C(a)C(d) = 010). We have multiple candidates 
that agree upon the same output. This is problematic because we cannot tell which 
input sequence is fed into. In other words, we cannot uniquely figure out the input 
sequence. 


A stronger condition: Unique decodability What additional condition do 
we need to satisfy in order to make a code valid? What we need is that for any 
encoded bit sequence, there must be no decoding ambiguity, in other words, there 
must be only one matching input sequence. This property is called unique decod- 
ability. This is equivalent to the one-to-one mapping constraint w.r.t. the sequence 
of source symbols with an arbitrary length. Here is a mathematical expression for 
unique decodability: for any n and m, 


C(x) C (x2) ++ Cn) = C(x) CO)» CO) 


a ae eee eee 
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A uniquely decodable example Let us give you an example where the unique 
decodability condition holds: 


C(a)=10; C()=00; C()=11; C(d)= 110. (1.21) 


To verify unique decodability, we can follow these steps. Let’s consider an output 
sequence: 


output sequence: 10110101111--- 


First we read a binary string until we find a matching codeword or a codeword which 
includes the string in part. In this example, the first read must be 10 because there 
is only one corresponding codeword: C (a). The corresponding input is “a”. What 
about the next read? An ambiguity arises in the next two bits: 11. We have two 
possible candidates: (i) a matching codeword C (c) = 11; and (ii) another codeword 
C(d) = 110 which includes the string 11 in part. Here the “11” is either from “c” 
or from “d”. It looks like this code is not uniquley decodable. But it is actually 
uniquely decodable — we can tell which is the correct one. The way to check is 
to look at the future string. What does this mean? Suppose we see one more bit 
after “11”, i.e., we read 110. Still there is no way to figure out which is correct 
one. However, suppose we see two more bits after “11”, i.e., we read 1101. We 
can then tell which symbol is fed into. That is, “d”. Why? Another possibility “cb” 
(C(c)C(6) = 1100) does not agree upon 1101. So it is eliminated. We repeat this. 
If the input sequence can be decoded in a unique way using this method, then the 
code is considered to be uniquely decodable. In fact, one can verify that the above 
mapping (1.21) ensures unique decodability.’ 


Constraints on £(x) induced by uniquely decodable property? Recall our 
goal: finding constraints on £ (x) in the optimization problem (1.19). How to trans- 
late the unique decodability property into a mathematical constraint in terms of 
€(x)’s? The translation is a bit difficult. 

Fortunately, we have some positive news to share. The good news is that there 
exists a simpler and indirect method to determine the constraint. This approach 
is based on a type of uniquely decodable codes known as prefix-free codes. The 
prefix-free code that we will soon discuss has two key features: firstly, it imposes 
the same constraint as the one required for uniquely decodable codes (i.e., any 
uniquely decodable code must adhere to the prefix-free code constraint); secondly, 
it provides an easier method for identifying the constraint that a valid code must 


satisfy. 


7. There is a rigorous way of checking unique decodability, proposed by Sardinas and Patterson. Please check 
Problem 5.27 in (Cover, 1999) for details. 
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The first feature implies that the prefix-free code constraint is both necessary and 
sufficient for a valid uniquely-decodable code. Therefore, it is enough to consider 
the prefix-free code when determining the constraint imposed by a valid code. Prov- 
ing the first feature is not a straightforward task, but in Prob 2.3, we provide mul- 
tiple subproblems that will assist you in proving it with relative ease. Nevertheless, 
the proof itself is non-trivial. 

To comprehend the second feature, we need to understand the prefix-free code 
in more detail. We will first explain what the code is and also mention a significant 
advantage it has over non prefix-free codes that are also uniquely decodable. 


Prefix-free codes Let us revisit the example (1.21) discussed earlier, which is 
a case of a uniquely decodable code. However, this example highlights a concern 
related to decoding complexity, which in turn motivates the use of prefix codes. 
The issue is that, as we observed in the previous example, decoding the second 
input symbol necessitates examining a future string, which implies that decoding is 
not immediate. This issue can be further exacerbated, particularly in the worst-case 
scenario, where the output sequence is as follows: 


1100000000000000000000000000000000000001. 


In this case, in order to decode even the first symbol, we have to take a look at many 
future strings. 

In the prefix-free code that we will define soon, there is no such complexity issue. 
Here is an example of the prefix-free code: 


C(a)=0; C(6)=10; C(c)=110; C(d)= 111. (1.22) 


One key property of this code is that no codeword is a prefix of any other codeword. 
This is why the code is named the prefix-free code. It is evident that the code in 
the previous example (1.21) is not prefix-free, despite being uniquely-decodable. 
One of its codewords serves as a prefix of another codeword, resulting in the need 
to examine future strings while decoding. In the worst-case scenario, the entire 
string must be examined to decode even the first input symbol. On the other hand, 
prefix-free codes like (1.22) do not have any codewords that act as prefixes of other 
codewords. This eliminates any ambiguity in decoding and avoids any decoding 
complexity issues, as there is no need to examine future strings to decode an input. 
The code can be decoded instantaneously, which is why prefix-free codes are also 
known as instantaneous codes. 
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Look ahead Remember the optimization problem (1.19) that we discussed ear- 
lier. The positive news is that: (i) the restriction on £ (x) that applies to the prefix-free 
code is the same as that imposed by the uniquely-decodable code, and (ii) it is easy 
to recognize the constraint caused by the prefix-free code. In the next section, we 
will determine the restriction that the prefix-free code property must satisfy. After 
that, we will approach the optimization problem. 
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1.5 Source Coding Theorem for i.i.d. Sources (2/3) 


Recap In the preceding section, we made an attempt to prove the source coding 
theorem for i.i.d. sources. To begin with, we concentrated on a basic symbol-by- 
symbol encoder, where the code operates independently on each symbol, without 
any regard for the other symbols. Our aim was to create a code C that minimizes 


the expected codeword length E[€(X)]. In order to accomplish this, we framed an 
optimization problem: 


min > PE) 
CO) eX (1.23) 


subject to some constraints on £ (x). 


The key to solving the problem is to come up with mathematical constraints on 
€(x) that a valid code (fully specified by the unique-decodability property) should 
respect. 

We acknowledged that deriving the constraints on €(x) for the optimization 
problem can be challenging. Therefore, we decided to take a different approach 
based on the following facts: (1) the constraints on €(x) that prefix-free codes 
(which are a subset of uniquely-decodable codes) satisfy are equivalent to those of 
uniquely-decodable codes; and (2) obtaining the mathematical constraints on £ (x) 
induced by the prefix-free code property is relatively straightforward. We postponed 
the proof of the first fact to Prob 2.3. 


Outline In this section, we are going to derive the constraint due to the prefix-free 
code property, and will attack the optimization problem (1.23) accordingly. 


Review of prefix-free codes We start by reviewing the prefix-free code example 
introduced in Section 1.4: 


Ci¥@)=0; C(é)=10; C(e) =110; Cid) =111. (1.24) 
No codeword is a prefix of any other codeword. So it is indeed prefix-free. 


From codeword to a binary code tree We present a visual depiction of the 
code that can assist us in determining the mathematical constraints on €(x) more 
easily. This representation is based on the binary code tree, which was introduced 
earlier. In a binary code tree, each node (either the root or an internal node) has 
two branches. A one-to-one correspondence exists between a code mapping rule 
and the representation of a binary code tree. 

To draw a binary code tree from a code mapping rule, we begin with the root 
node and draw two branches that originate from it. We then label the top branch 
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Figure 1.10. The codeword representation via a binary code tree. 


with 0 and the bottom branch with 1. We may want to take the other way around: 1 
for the top and 0 for the bottom. This is our own choice. We then have two nodes. 
Following that, we assign a binary label sequence to each node that represents the 
path from the root to that particular node. Specifically, we assign a label of 0 to 
the top branch and a label of 1 to the bottom branch. After that, we check if any 
codeword matches either of the two binary sequences 0 and 1. Since the codeword 
C(a) matches the sequence 0 assigned to the top node, we label the top node with 
“a”. A visual representation of this process can be found in Fig. 1.10. 

We follow a similar process for the bottom node. However, since there is no 
matching codeword, we generate two additional branches from the node and assign 
a label of 0 on the top and 1 on the bottom. We assign the sequence of binary labels 
“10” to the top node and “11” to the bottom node. Since the codeword C(b) is 
identical to “10”, we assign “b” to the top node. There is no matching codeword for 
“11”, so we split the bottom node into two, assigning “110” to the top and “111” 
to the bottom. Finally, we assign “c” to the top and “d” to the bottom. 

Representing the code using a binary code tree helps in identifying a mathemat- 
ical constraint on £ (x) that a prefix-free code should adhere to. The following two 
observations are helpful: 


Observation #1 The first observation is regarding the location of nodes to which 
symbols are assigned. A tree consists of two types of nodes. One is an ending node 
(terminal) which has no further branch. We call that ending node a leaf. The second 
is a node from which another branch generates. We call it an internal node. Keeping 
these in our mind, let us take a look at the earlier binary code tree illustrated in 
Fig. 1.10. Notice that all codewords are assigned to leaves only. In other words, 
there is no codeword that is assigned to an internal node. Can you see why that is 
the case? If there were a codeword that is assigned to an internal code, we would 
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violate the prefix-free code property because that codeword is a prefix of some other 
codeword which lives in the associated leaf. So the first observation that one can 
make from the prefix-free code is: 


Observation #1: Codeword must be a leaf in a binary code tree. 


Observation #2 Let us move on to the second observation that serves to relate 
the prefix-free code property to the mathematical constraint on €(x). That is, 


Observation #2: Codeword can be mapped to a subinterval in {0, 1]. 


What does this mean? We can illustrate this by adding a new diagram to the binary 
code tree shown in Fig. 1.10. In this diagram, we associate an interval [0, 1] with 
the set of all codewords. A line is drawn through the root of the tree, and the 
midpoint (0.5) of the associated interval is assigned to the point where the line 
intersects with the [0, 1] interval. We then assign a codeword to the subinterval 
[0, 0.5], since there is only one codeword above the root level. However, below the 
root level, there are multiple codewords. To address this, we draw another line on 
the central interior node in the bottom level, and assign the midpoint (0.75) of 
the [0.5, 1] interval to the point where the line intersects with the [0.5, 1] interval. 
We then assign the codeword C(b) to the subinterval [0.5, 0.75]. We continue this 
process until all codewords are assigned to subintervals, resulting in the diagram 
shown in Fig. 1.11. As a result, each codeword is mapped to a unique subinterval 
of [0, 1]: 


C(a) © [0, 0.5]; 
C(b) © [0.5, 0.75]; 
C(c) © [0.75, 0.875]; 
C(d) © [0.875, 1]. 
This naturally leads to the following two facts: (1) subinterval size = 2~°); 
and (2) there is no overlap between subintervals. The second fact comes from the 


first observation: an interior node cannot be a codeword. This yields the following 
constraint: 


> <1. (1.25) 


xEX 


This is called Krafts inequality. 
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Figure 1.11. Observation #2: Any codeword can be mapped to a subinterval in [0,1]. 


An optimization problem Using Kraft’s inequality, we can formulate the opti- 
mization problem as: 


min 2 poets) 


subject to yo <1, €@EN, £x) 21. 
xEX 


One can ignore the constraint f(x) > 1. Why? Otherwise, the Kraft’s inequality 
Sex 276 < 1 is violated. So the simplified problem reads: 


min Saxe) 
(1.26) 


subject to 2 <1, €@eEN. 
xEX 


Non-convex optimization The optimization problem (1.26) is widely known 
for its difficulty in finding a solution. It falls under a category of optimization 
problems that are generally considered challenging. To see this, remember one 
definition that we introduced in Section 1.2. That is, concave functions. We say 
that a function is concave if Vx},x2 and À € [0,1], Af(m1) + A — Af (x2) < 
f (2x1 + (1 — 4)x2). There is another type of functions which are defined in a simi- 
lar yet opposite manner: convex functions. We say that a function f is convex if —f 
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is concave, i.e., Vx1, x2 and A e [0, 1], 


Af (x1) + A = Af 2) = fax + = 4)x2). (1.27) 


The inequality has an opposite direction compared to the one used in defining 
concave functions. 

If we consider the optimization problem (1.26) with the concept of convex func- 
tions, we can observe that the objective function is convex in €(x). Additionally, the 
left-hand side of the inequality constraint (which can be transformed so that the 
right-hand side is 0) is $ex 2-6) — 1, which is also convex in &(x). An opti- 
mization problem is deemed convex when both the objective function and the left- 
hand sides of the constraints are convex in the variables (Boyd and Vandenberghe, 
2004). Conversely, if an optimization problem contains any non-convex objective 
function and/or non-convex functions in the inequalities, it is considered to be 
non-convex. 

In (1.26), the objective function and the function in the inequality constraint are 
both convex. However, when considering the integer constraint €(x) € N, we must 
take into account the definition of convexity with respect to a set. A set is considered 
convex if any linear combination of two points in the set is also in the set. If not, 
the set is non-convex. In this case, N is a non-convex set. To see this, consider two 
integer points, 1 and 2, and one linear combination of them, 1.5. Clearly, 1.5 is not 
an integer. As a result, the integer constraint €(x) € N is non-convex, making the 
optimization problem non-convex as well. Problems with non-convex constraints, 
particularly integer constraints, are notoriously difficult to solve. As a result, the 
optimization problem at hand is extremely challenging and remains unsolved to 
date. 


Approximate! To tackle this challenge, Shannon took a different approach. Rec- 
ognizing the difficulty of the problem, he aimed to provide insight into the solution 
rather than solving it exactly. He proposed that while the problem is difficult, it may 
be possible to find an approximate solution that is similar to the exact solution. 
Therefore, his objective was to create upper and lower bounds on the solution that 
are sufficiently close to each other. In other words, Shannon attempted to approx- 
imate the solution of the problem rather than solving it exactly. 


A lower bound The goal of the optimization problem presented in (1.26) is to 
minimize the objective function. Therefore, one would expect that a larger search 
space for €(x) would lead to a smaller or equal solution compared to the exact solu- 
tion in the original problem. Shannon was motivated by this insight to expand the 
search space in order to obtain a lower bound. One way to achieve this is by remov- 
ing a constraint. Unsurprisingly, Shannon chose to remove the integer constraint, 
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which includes a non-convex set, and is the main reason that makes the problem 
challenging. Here is the relaxed version of the problem that Shannon formulated: 


L := min > PE) 


C xEX 


subject to > <1. 
xEX 


(1.28) 


Look ahead The optimization problem we have presented aims to minimize the 
expected length of codewords, taking into account Kraft’s inequality and the integer 
constraint on ¢(x) in the original formulation. However, we relaxed the integer 
constraint, which allowed us to convert the non-convex problem into a manageable 
convex optimization problem. In the following section, we will solve the convex 
optimization problem to obtain a lower bound. Subsequently, we will derive an 
upper bound and use both bounds to prove the source coding theorem. 
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1.6 Source Coding Theorem for i.i.d. Sources (3/3) 


Recap In the previous section, we formulated an optimization problem that aims 
to minimize the expected codeword length while satisfying both Kraft’s inequality 
and the integer constraint on €(x). However, we encountered a challenge as the 
problem is non-convex and generally intractable. To make progress, we followed 
Shannon's approach of approximation by deriving lower and upper bounds that 
are as close as possible. To obtain a lower bound, we employed a trick of relaxing 
constraints and expanding the search space, which involved removing the non- 
convex integer constraint. This allowed us to convert the problem into a tractable 
convex optimization problem: 


L := min 2, PA) 
(1.29) 


subject to Pe <1. 
xEX 


Outline In this section, we will derive the lower bound. We will also derive an 
upper bound by introducing another trick. Based on the lower and upper bounds, 
we will finally complete the proof of the source coding theorem. 


A lower bound Convex optimization problems, where both the objective func- 
tion and constraint functions are convex, have been extensively studied and there 
exist numerous methods to solve them. One such method is the Lagrange multiplier 
method, which is commonly taught in Calculus courses (Stewart, 2015; Boyd and 
Vandenberghe, 2004). The idea behind this method is to introduce a new variable 
called the Lagrange multiplier, denoted by 4, and define a Lagrange function that 
involves both the optimization variables €(x)’s and the Lagrange multiplier: 


L(E(x),A) = D2 ple€(x) +A (= ya i) , 


xEX xEX 


In the canonical form, the number of Lagrange multipliers equals the number of 
constraints. In this case, we only have one constraint, so we have one Lagrange mul- 
tiplier. The Lagrange function consists of two parts: (i) the objective function, and 
(ii) the product of the Lagrange multiplier and the left-hand side of the inequality 


constraint.® 


8. Inthe canonical form of inequality constraints, the right-hand-side reads 0 and the inequality direction is <. 
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How does the Lagrange multiplier method work? We take a derivative of the 
Lagrange function w.r.t. optimization variables €(x)’s. We also take a derivative 
w.r.t. the Lagrange multiplier 1. Setting these derivatives to zero, we get: 


LEWD 
del) 

LEEA i 
dh E 


It has been found that the solution can be obtained by solving these equations’ 
under the constraint of 2 > 0. However, we will not use this method for two 
reasons. Firstly, this method is somewhat complicated and messy. Secondly, it is 
not quite intuitive as to why it should work, as there is a deep underlying theorem 
called the strong duality theorem (Boyd and Vandenberghe, 2004) which proves 
that this approach leads to the optimal solution under convexity constraints. Since 
we will not deal with the proof of this theorem, it is reasonable not to take an 
approach that relies on it. In Prob 2.4, you will have an opportunity to use the 
Lagrange multiplier method to solve the problem. 

Rather than following the conventional approach, we will adopt a much sim- 
pler and intuitive alternative. Before delving into the specifics of the method, let 
us streamline the optimization problem even further. It is worth noting that we 
can disregard situations where the strict inequality is satisfied: X oy 276) < 1. 
Suppose there exists an optimal solution, say €*(x), such that the strict inequality 
holds: $ y 27° < 1. Then, one can always come up with a better solution, 
say l’ (x), such that: for some xo, 


€'(x0) < £* (xo); 
C) =O" (x) Vx F xo; 


ba aes =1. 


xEX 


We reduced £* (xo) a bit for one particular symbol xo, in an effort to increase 
2-© C0) so that we achieve X. 2-) = 1. This is indeed a better solution as it 
yields a smaller objective solution due to €/(xo) < €*(xo). This is contradiction, 
implying that an optimal solution occurs only when the equality constraint holds. 
Hence, it suffices to consider the equality constraint. The optimization can then be 


9. These are called the KKT conditions in the optimization literature (Karush, 1939; Kuhn and Tucker, 2014). 
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simplified as: 


£: min D pee (x) 
(1.30) 
subject to 2 =. 
xEX 


The approach that we will take is based on a method called “change of variables”. 
Let q(x) = 27°). Then, the equality constraint becomes Xey q(x) = 1, and 
€(x) in the objective function should be replaced with log reg 


1 
L= mi log — 
Oo e 

(1.31) 


subject to Sw =1, g¢x)> 0. 
xEX 


Observe that the constraint g(x) > 0 is introduced as g(x) = 272), When imple- 
menting a “change of variable”, we need to be mindful of any inherent restrictions 
on the novel variables that were not present in the initial optimization problem. 
Now take note that X` ey g(x) = 1. What does this trigger in your memory? A 
probability mass function! Remember the axiom that the pmf must satisfy. 

The objective function bears a resemblance to the one presented earlier, namely 
entropy H(X). We now assert that the solution to the optimization problem is 
H(X), and the minimizing function q* (x) (that minimizes the objective function) 
is equal to p(x). Here is the reasoning behind this assertion. 

If we subtract the objective function from H (X), we obtain: 


1 1 
S| ple) log ie Dr) EG) 


xEX xEX 


= Sw) jag 


xEX q(x) 


=E og | 
PL? g(X) 


What does the final term bring to mind? That is, the Kullback-Leibler (KL) diver- 
gence that we covered in Section 1.3. By leveraging a crucial fact about the KL 


divergence, namely that it is non-negative (verified in Prob 1.12), we can readily 
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observe that the objective function is minimized when £ = H(X), and the mini- 
mizer is: 


1 
* . * 

q ix) = p(x), ie, 2°) EG (1.32) 
An upper bound Let’s now shift our focus to the issue of an upper bound. To 
generate a lower bound, we expanded the search space. Conversely, what is a natural 
approach to generating an upper bound? The answer is to narrow down the search 
space. We will adopt a basic method for narrowing the search space: selecting a 
specific choice for the optimization variables €(x)’s. 

What is the particular choice we want to make for £ (x)? Choosing (x) randomly 
could result in a potentially weak upper bound. Therefore, we need to be cautious 
when making our selection. One might speculate that a good choice would be simi- 
lar to the optimal solution in the relaxed optimization problem (without the integer 
constraint). In this respect, the minimizing function g* (x) for the relaxed optimiza- 
tion problem can provide some guidance. Recall the minimizer in the instance: 


g (x) = p(x). 


Since q(x) := 2-6), in terms of (x), it would be: 


7 1 
* (x) 18 ; 

If €*(x)’s were integers, we are happy as we can obtain the exact solution to 
the original problem. In general, however, €*(x)’s are not necessarily integers. One 
natural choice for £ (x) is then to take an integer which is as close to €* (x) as possible. 
So one can think of two options: (i) [log rote and (ii) [log zg. Which of the 
two options would you like to choose? In reality, the first option is invalid. Why? 
Consider Krafts inequality. Therefore, the appropriate selection is the second one. 
With the second choice, we obtain: 


L* < S o) og <5 | 


xEX p(x) 
1 
< 220) (e= + i) 
= H(X) +1. 


Are the bounds tight? In summary, what we can say for £* is : 


H(X) < L* < H(X) + 1. 
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First take a look at the lower bound. The lower bound is tight when log PON are 
integers (i.e., p(x)’s are integer powers of 2) and therefore £* (x) can be chosen as 
log RA without violating the integer constraint. However, this is a particular case 
because in general p(x) is not limited to that particular type. As for the upper bound 
H(X) + 1, if H(X) is large enough, the gap of 1 would be negligible. However, if 
H(X) is comparable to (or much smaller than) 1, the bounds are loose. For instance, 
consider a case in which X is a binary random variable with P(X = 0) = p where 
p < 1. In this case, H(X) is close to 0; hence, the bounds play almost no role in 
the case. 


General source encoder Despite our significant efforts to approximate L*, 
we discovered that the bounds are generally not tight, rendering our efforts use- 
less. However, the methods employed to derive these bounds play a crucial role in 
demonstrating the source coding theorem. Here’s why. 

Our analysis has been limited to a particular scenario in source encoding. We 
have focused on a symbol-by-symbol encoder that processes each symbol indepen- 
dently. However, the encoder can handle an arbitrary length of input sequence to 
produce an output. As a result, it can operate on multiple symbols. For instance, 
one could take an -length sequence of Z, := (X1,X2,...,Xn), which is called a 
super symbol, and generate an output like C (X1, X2, . . . , Xn). The sequence length 
n is a design parameter that can be chosen as desired. 

Interestingly, if we apply the bounding methods we have learned to this general 
situation, we can readily demonstrate the source coding theorem. Let £% be the 
minimum expected codeword length concerning the super symbol Z,,: 


L = a Sz, (z)€(z). 
2 eZ 


Applying the same bounding techniques to £L*, we get: 
H(Z,) < Lh < A(Z,) +1. 


Since the codeword length per symbol is of our interest, what we care about is the 
one divided by z: 


HZ) _ £4 . HG )+1 


(1.33) 
n n n 

Remember that the length 7 of the super symbol is of our design choice and hence 

we can choose it so that it is an arbitrary integer. In an extreme case, we can make 


it arbitrarily large. This is the way that we achieve the limit. Note that the lower 
A(Z,) 


n 


and upper bounds coincide with as 7 tends to infinity. 
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We are almost done with the proof. What remains is to calculate such matching 
quantity. Using the chain rule (a generalized version of the chain rule — check in 
Prob 1.10), we get: 


A(X, X2, ..., Xn) 


n 
_ HX) HHQ) +: + AIK, Xn) 
n 
à HX) + H(%) +--+ H%,) 
E n 
2 HO 


where (a) follows from the independence of (X1, X2, ..., Xn) and the fact that 

H(X2\X1) = H(X2) when Xı and X are independent (check in Prob 1.10(4)); 

and (0) is due to the fact that (X1, X2, . . . , Xn) are identically distributed. 
Applying this to (1.33) together with the sandwidth theorem, we obtain: 


L* 
lim Z = H(X). 
mo n 
This proves the source coding theorem for i.i.d. sources: the maximum compression 
rate of an information source per symbol is H(X). 


Look ahead The source coding theorem has been proven, showing that optimal 
codes exist that can achieve the limit. The optimization problem was formulated 
and it was shown that the solution to the problem is entropy. However, we did not 
discuss the explicit sequence pattern of the optimal codes, which means that we did 
not address how to design the optimal codes. In the next section, we will delve into 
this issue more deeply. 
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Problem Set 2 


Prob 2.1 (Unique decodability) Consider a binary code tree in Fig. 1.12 for 
a symbol X € {a,b,c,d}. Is the code uniquely decodable? Also explain why. You 
don’t need to prove your answer rigorously. A non-rigorous yet simple explanation 
based on what we learned in Section 1.4 suffices. 


Prob 2.2 (Prefix-free codes vs. Kraft’s inequality) Consider prefix-free 
codes for X € {1,2,...,M@} where M is an arbitrary positive integer. Let £ (x) 
be the codeword length for X = x. In Section 1.5, we showed that such prefix-free 
codes satisfy Kraft’s inequality: 


yo <1. 


xEX 


In this problem, you are asked to prove the converse: showing that if Kraft’s inequal- 
ity holds, then there exists a prefix-free code with such €(x)’s. 


Prob 2.3 (Proof of Kraft’s inequality for uniquely decodable codes) In 
Section 1.5, it was claimed that a direct way to come up with a mathematical con- 
straint on €(x)’s that uniquely decodable codes should satisfy is difficult. In this 
problem, you are asked to develop the constraint. Consider a symbol-by-symbol 
source code in which the code acts on each individual symbol independently, 
i.e., an input sequence of X1 X - - - X, yields the output C(X1)CQQ)--- C(X,) 


o 7 00 (a) 


11(c) 


Figure 1.12. Illustration of a binary code tree with four codewords. 
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where C(X) indicates a codeword for a symbol X. Each symbol X takes on one 
of the following values 1,2,..., M. Let €(z) be the length of codeword C(z) where 
i € {1,2,...,M}. Suppose that the source code is uniquely decodable. Through 
the following subproblems, you are asked to show that Kraft’s inequality holds for 
the uniquely decodable code: 


M 
» 2” <1. 
i=1 


Remark: This implies that Kraft’s inequality is also necessary, thus proving that 
Kraft’s inequality (induced by prefix-free codes) is indeed a necessary and sufficient 
condition that uniquely decodable codes should satisfy. 


(a) For a positive integer n, show that 


M M M M 
È a — ` >, a E 2 EU) +E) +--+) | 
i=1 


qy=1n=1 in=1 


(b) Consider a concatenation of n symbols XX - - -Xan = i1i2+++i,. What 
is the codeword corresponding to such a concatenation? Also what is the 
length of the corresponding codeword? 

(c) Let max = maxy<j<y €(2) and Emin := minj<j<y E(t). Let Ne be the 
number of 7-fold concatenated sequences which yield codewords of length 


£. Using parts (a) and (b), show that 


M i nÉ max 
Sr] = F no 
i=l 


=n min 


(d) Show that 
Ne < 2f, 
(e) Using parts (c) and (d), complete the proof of Kraft’s inequality. 


Prob 2.4 (Convex optimization) Let X e XÆ be a discrete random variable 
with pmf p(x). In Section 1.6, in the process of proving the source coding theorem, 
we considered the following optimization problem: 


min 2 poe) : 


Ja <1. 


xEX 
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(a) State the definition of a convex set. 

(b) Consider a set A := {€(x) : Dey 2726) < 1}. Prove that the set A is 
convex. 

(c) State the definition of convex optimization. 

(d) Is the above optimization problem convex? Also explain why. 

(e) Using the Lagrange multiplier method that we discussed in Section 1.6, 
derive the solution to the optimization problem as well as the minimizer 


€* (x). 
Prob 2.5 (True or False?) 


(a) Consider a discrete random variable X € 4. Suppose a source code w.r.t. 
X satisfies: 


5r <1 


xEX 


where €(x) denotes the codeword length w.r.t. x. Then, there always exists 
a prefix-free code that satisfies the above. 


(4) Any codeword of a uniquely decodable code cannot be mapped to an inter- 
nal node in the corresponding binary code tree. 


(c) Consider a source symbol X € {a, b, c, d}. Consider a source code: 
C(a) = 0; CH = 11; C(A = 111; C@ = 010. 


This code is uniquely decodable. 


(d) Ifthe most probable letter in an alphabet has probability less than 7 any 
prefix-free code will assign a codeword length of at least 2 to that letter. 


(e) In Section 1.5, we learned that the integer set N = {...,—2,—1,0,1, 
2,...} is non-convex. On the other hand, the real set R is convex. 
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1.7 Source Code Design 


Recap In the preceding sections, we established the source coding theorem for 
the i.i.d. source scenario. To gain insights, we first examined a simple but restricted 
context - the symbol-by-symbol encoder, where the source encoder operates on each 
symbol independently. Subsequently, we formulated an optimization problem that 
aims to minimize the expected codeword length in this scenario. However, due to 
the problem’s intractability, we attempted to approximate the solution by deriving 
reasonably tight lower and upper bounds. By applying simple yet powerful bound- 
ing techniques, we were able to derive lower and upper bounds that differ by 1. 

Furthermore, we extended these bounding techniques to a general setting in 
which the source encoder operates on multiple symbols of possibly varying lengths, 
resulting in the following: 


A(X,...,Xn) 2 L> 2 A(X%,...,Xn) +1 
n oa n 


where L% indicates the minimum expected codeword length w.r.t. a super symbol 
Zn := (X1, . . ., Xy). In the limit of 7, this gives: 


* 


ss H(X). (1.34) 
n 


Outline The implication of (1.34) is that there are optimal codes that can reach 
the limit, but we haven't discussed the specific sequence patterns of these optimal 
codes. In other words, we haven't explained how to design these optimal codes. This 
issue will be thoroughly investigated in this section. 


Regimes in which one can achieve the limit Recall the setting where we can 
achieve the limit, i.e., the setting in which a super symbol is taken while its length 
tends to infinity. In order to design codes, we need to specify the following two: 
the length and pattern of codewords. The minimizer of the relaxed optimization 
problem (see (1.32)) gives insights into the length of optimal codewords. Recall: 


1 
€*(X”) = log —— 
pX”) 
where X” denotes the shorthand notation of the sequence: X” := (Xj,...,Xn). 
Hence, in order to figure out how the optimal length £*(X”) looks like in the 
interested regime of large 7, we should take a look at the behavior of p(X”) for the 
regime. 


Behavior of p(X") for a large value of n We focus on the binary alphabet 
case in which X; € {a, b} and P(X; = a) = p. Later we will consider the general 
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alphabet case. The sequence consists of a certain combination of two symbols: a’s 
and 6’s. For very large n, an interesting behavior occurs on two quantities regarding 
the sequence. One is the faction of symbol “a”, represented as the number of a’s 
divided by 7. This is an empirical mean of occurrences of symbol “a”. The second 
is the symbol “b” counterpart, the fraction of symbol 0. 

To see this, we compute the probability of observing X”. Since X” is i.i.d., p(X”) 
is simply the product of individual probabilities: 


P(X") = p(X pO) - ++ p%r). 


Each individual probability is p or 1 — p depending on the value of X;. So the result 
would be of the following form: 


p(X”) =p" fas qg -pË of b's} 


Consider log T (= €*(X”)) that we are interested in: 


1 1 
lo = {# of as} - log — + {# of Bs} - lo : 
5p) 5? “iy 
Dividing by 7 on both sides, we obtain: 
1 1 # ofa 1 # of B 1 
log = ee) log- + ae - log ' (1.35) 
n ~ p(X”) n p n 1—p 
What can we say about this in the limit of 7? Your intuition says: if of ds} >p 


as n — oo. One can naturally expect that as n —> 00, the fraction would be 
concentrated around P(X = a). This is indeed the case. Relying on a well-known 
theorem, called the Weak Law of Large Numbers (WLLN’®), one can prove this. Let 
Y; = 1{X; = a} where 1{-} is an indicator function that returns 1 when the event 
(-) is true; 0 otherwise. Consider: 


tte +Y, 
n 


Saz 


where S, indicates the fraction of a’s. Obviously it is a sequence of z. 

Now consider the convergence of the sequence. From Calculus, you learned 
about the convergence of sequences, which are deterministic. But the situation is 
a bit different here. The reason is that S, is a random variable (not deterministic). 
We need to consider the convergence w.r.t. a random process. There are multiple 
types of convergence w.r.t. random processes. One type of the convergence that is 


10. This is the law (discovered by Jacob Bernoulli) mentioned in Section 1.3 
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needed to be explored in our problem context is the convergence in probability. In 
fact, what the WLLN that we mentioned above says is that 


WLLN: S, converges to E[Y] = p in probability. 


Simply speaking, it means that S, converges to p with high probability (w.h.p.). But 
in mathematics, the meaning should be rigorously stated. What it means by this in 
mathematics is that for any € > 0, 


P(S — pl < €)—> 1 asn — CO. 


In other words, S, is within p + € w.h.p. 
Applying the WLLN to Ë of ds} and [#°f%5) we get: for any €1,€2 > 0, 


n 


# of a’ 
MOET ok ia eres: 


# of b 
OE pian pe, 
n 


as n — oo. Applying this to (1.35), we can say that in the limit of 7, 


1 1 1 
-lo = (p+ €) log- + (1 — p + €2) lo 
n 5 p(X") . 53 P By =p 
= H(X)+e 
where € := €] log 5 + €2 log = 
Manipulating the above, we get: 
p(X") = 2709 holds w.h.p. (1.36) 


The key observation is that for a very large value of n, € can be made arbitrarily 
close to 0 and thus p(X”) is almost the same for all such X”. We call such sequences 
typical sequences. The set that contains typical sequences is said to be a typical set, 
formally defined as below. 


AY r= {x": 27H (p)+©) < p(X”) < 270-9}, 


Notice that p(X”) ~ 27”) is almost uniformly distributed. This property plays 
a crucial role to design a optimal code. In the sequel, we will use this property to 
design the code. 
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Prefix-free code design What (1.36) suggests is that any arbitrary sequence 
is asymptotically eguiprobable. Remember in Section 1.1 that a good source code 
assigns a short-length codeword to a frequent symbol, while assigning a long-length 
codeword to a less frequent symbol. For a very large value of n, any sequence 
is almost equally probable. This implies that the optimal length of a codeword 
assigned for any arbitrary sequence would be roughly the same as: 


C(x”) = log Oe nH (X). 


A binary code tree can be constructed where the depth of the tree is approximately 
nH (X), and the codewords are allocated to the leaves. It is worth noting that in a 
prefix-free code, the codewords are located only in the leaves. 

Let us check if we can map all the possible sequence patterns of X” into leaves. 
To this end, we need to check two values: (1) the total number of leaves; and (2) the 
total number of possible input sequences. First of all, the total number of leaves is 
roughly 2”7@), What about the second value? The total number of input sequence 
patterns is 2” because each symbol can take on one of the two values (“a” and 
“b”) and we have n of them. But in the limit of 7, the sequence X” behaves in a 
particular manner, more specifically, in a manner that p(x”) ~ 27”7@); hence, 
the number of such sequences is not the maximum possible value of 2”. Then, how 
many sequences such that p(x”)  27”7@)? To see this, consider: 


Zz pla”) © Hax” : p(x”) 2-MFCOY x 2-2, 


x! p(x) 
The aggregation of all the probabilities of such sequences cannot exceed 1; hence, 
[Lx spe”) 2-H E, 


Note that the cardinality of the set does not exceed the total number of leaves 
~ 2”). So by using parts of leaves, we can map all of such sequences. This 
completes the design of an optimal prefix-free code. See Fig. 1.13. 

Finally let us check if this code indeed achieves the fundamental limit H(X). 
Notice that the length of every codeword is ~ nH (X). So the expected codeword 
length per symbol would be ~ nit) = H(X). 

Remark: The argument for the source code design is based on approximation. 
You will have a chance to do this rigorously in Prob 3.7. 
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=. # of leaves ~ 2") 


(0) 

(0) a 
1 > Use part of the leaves 
Q —_ for mapping codewords 
1 


Figure 1.13. Design of optimal prefix-free codes. 


Extension to non-binary sources We have considered the binary alphabet 
case. How about for non-binary sources? Using the WLLN, one can prove that: 


1 l 1 
—1o 
n 8p) 


— H(X) in prob. 


although X is an arbitrary random variable, not limited to the binary one. Please 
check this is indeed the case in Prob 3.4. Roughly speaking, this implies that 
p(X”) ~ 277409 wh.p. — any arbitrary sequence is asymptotically equally prob- 
able. Using this and applying the code design rule based on a binary code tree, we 
can easily construct an optimal source code. Every input sequence is mapped to a 
leaf in a binary code tree with depth © nH (X) and hence, the expected codeword 
length per symbol is H(X). The sequence pattern of a codeword is determined by 
which leaf the codeword is assigned to. 


Look ahead We have focused on i.i.d. sources so far. However, in real-world sce- 
narios, many information sources exhibit non-i.i.d. behavior. Therefore, it is cru- 
cial to investigate the source coding theorem and optimal code design for non-i.i.d. 
sources. In the next section, we will delve into these practically-relevant scenarios. 
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1.8 Source Coding Theorem for General Sources 


Recap In the previous section, we examined the method for creating an optimal 
code that can achieve the entropy promised by the source coding theorem for the 
i.i.d. source case. The approach utilized a super symbol-based technique, where the 
source encoder operates on a sequence of multiple input symbols (7 symbols), and 
the value of 7 is chosen by the designer. The construction aimed to increase the size 
of the super symbol to a sufficiently large value. As we analyzed the technique for 
large n, we discovered using the WLLN that: 
in prob. 


1 1 


i.e., i log z5 lies in between H(X) — € and H(X) + € for any € > 0, as 7 tends 


to infinity. Inspired by the fact that the codeword length solution for the lower 
1 

p(x”) 5) 

the quantity of oa" From (1.37), we observed that in the limit of 7, the quantity 


bound in the interested optimization problem is €*(x”) = log we focused on 


becomes: 


1 

log ea) x nH (X). 

As a result, we were prompted to investigate a prefix-free code where a given input 
sequence x” is assigned to a leaf located at the level with a tree depth of approxi- 
mately nH (X). By doing so, we guarantee that the expected codeword length per 
symbol is approximately H(X), meeting the expected limit. Additionally, we ver- 
ified that the number of possible input sequences with probability p(x”) approxi- 
mately equal to 2~”") is lower than the total number of leaves. This guarantees 
a unique mapping for each of these input sequences. 


Outline Our focus was solely on i.i.d. sources. However, when it comes to non- 
i.i.d. sources, one might wonder what the corresponding source coding theorem is, 
as well as how we can design optimal codes for such sources. In this section, we will 
explore these inquiries. 


General non-i.i.d. sources In real-world scenarios, most information sources 
deviate significantly from the i.i.d. assumption. One prime example of such a source 
is an English text. To illustrate this, let’s take the example where the first and second 
letters are “t” and “h” respectively. In this case, it is reasonable to expect that the third 
letter would be “e” due to the high frequency of the word “the” in typical English 
texts. This example highlights the strong correlation between symbols in a text, 


indicating that the sequence is dependent and not i.i.d. Many other information 
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sources exhibit similar characteristics, making it imperative to explore the source 
coding theorem and optimal code design for non-i.i.d. sources. 


Source coding theorem for general sources By utilizing the bounding tech- 
niques we have learned, we can easily address the non-i.i.d. source case. If we 
utilize a super symbol-based source code with a super symbol size of n, and let 


L* = E[€(X,)], we can apply the same lower and upper bound techniques to 
show that: 


A(X, X, ivan’ ,X”) < L? < A(X, X, aa Xn) + 1. 
Dividing the above by 7, we get: 


n 


A(X, X2,...,X”) a Le Z A(X, X2,...,Xn) +1 


n n n 


The expected codeword length per symbol is related to the following quantity: 


. A(X, X2,...,Xn) 
lim ; 


n= co n 


Is the limit present? If so, we can conclude the task. However, it is not always the 
case, as there are certain artificial instances where the limit does not exist, as shown 
in page 75 of (Cover, 1999). Nonetheless, in numerous cases that are of practical 
importance, the limit does exist. Therefore, let us only focus on those cases where 
the limit exists, and we can express the source coding theorem as follows: 


Minimum # of bits that can represent a general source per symbol 


. A(X, X2,...,Xn) 
= lim ; 


n= Co n 


A 


H(X, X2,..., Xn) 


(> 72 


Figure 1.14. liM% means the growth rate of the sequence uncertainty w.rt. 
n, so it is called the entropy rate. 
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There is a term that refers to this limit. In order to comprehend why this term 
is used, take a look at a graph where the 7 and H(X1, X2, . . ., Xn) are represented 
on the x and y axes, respectively. This can be observed in Fig. 1.14. What the above 
limit means is the slope. In other words, it means the growth rate of the sequence 
uncertainty w.r.t. n. Hence, it is called the entropy rate. 


Stationary process You might be wondering how to calculate the entropy rate. 
In many cases of practical significance, computing the entropy rate is a straightfor- 
ward task. One such example is a stationary process. A random process is consid- 
ered stationary if {X;} has the same statistical properties (such as joint distribution) 
as its shifted version {X;4¢} for any non-negative integer £. An example of this is 
English text. The statistics of a 10-year-old text would be almost identical to that 
of a present-day text; for instance, the frequency of the word “the” in an older text 
would be roughly the same as in a contemporary text. 

When the limit is applied to a stationary process, it can be simplified even further. 
To illustrate this, let’s consider: 


A(X, X2,...,Xn) 
= H(X) + A(X)|X1) +--+ + A(X... Xn—-1) 


= S HIX X te ,Xi—1) 


i=1 
n 
=) AIX") 
i=1 
where the first equality is due to the chain rule and the last comes from the short- 
hand notation that we introduced earlier: X5! := (Xj,...,Xj-1). Let a; = 
H(X;|X‘~'). We can see two properties of the deterministic sequence {a;}. First it 
is non-negative. Second, it is non-increasing, i.e., 4;+1 < aj. To see this, consider: 
Ajay = H (Xii lX Xo, «0 X;) 
< A(Xi+1| Xs... Xi) 
= H (XiX... Xi—1) 
where the inequality follows from the fact that conditioning reduces entropy and 


the second last equality is due to the stationarity of the process. These two prop- 
erties imply that the deterministic sequence {a;} has a limit. Why? Please check in 
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Prob 3.8. Now consider 


H(X... X) 1% 
= Qj. 


This quantity indicates the running average of the sequence {a;}. Keeping in your 
mind that {a;} has a limit, what your intuition says is that the running average will 
converge to the limit because almost all the components in the running average will 
converge to the limit. This is indeed the case. The entropy rate of the stationary 
process is: 


H(X) = lim H(X;|X"~}). (1.38) 


Please see Prob 3.8 for the rigorous proof. 

Next, how to design optimal codes for such a stationary process? We can apply 
the same methodology that we developed earlier. The first is to check that such a 
stationary sequence is also asymptotically equiprobable. In fact, one can resort to a 
generalized version of the WLLN to prove that this is indeed the case: 


pX 2, (1.39) 


The proof of this is not that simple, requiring some non-trivial tricks. We will not 
deal with the proof here. If you are interested, you can try it via Prob 3.10. Recalling 
what we learned in the previous section. What (1.39) suggests is that the optimal 
code assigns the same codeword length for every sequence and the length should 
read roughly nH (4). The sequence patterns will be determined by the binary code 
tree of depth © nH (X) in which codewords are mapped to leaves. 


From theory to practice Up to this point, we have discussed the limit (the 
entropy rate) in relation to the compression rate of an information source (which 
is almost always a stationary process in practical applications) and learned how to 
design optimal codes that achieve this limit. However, our focus has been on an ide- 
alistic scenario where the super symbol size n can be made infinitely large. In reality, 
n must be finite for two reasons. Firstly, the hardware used to implement the code 
cannot support an infinite 7 size. Secondly, the amount of available information 
from the source is limited. Therefore, the question arises: what are the optimal codes 
for a finite value of n? This was the question posed by Shannon and shared with one 
of the professors in MIT, Prof. Robert Fano. Prof. Fano then shared this question 
with students who took the information theory course that he held at that time. 

As previously mentioned, solving the optimization problem for the finite 7 case 
is non-convex and extremely difficult. To date, a closed-form solution to the limit 
remains elusive. 
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However, surprisingly, one of Prof. Fano’s students, David Huffman, devised a 
simple algorithm that leads to the optimal code. Although he did not provide an 
exact closed-form solution to the limit, he was able to give an explicit design rule in 
the form of an algorithm that generates the optimal code. This code is now known 
as the Huffman code, and it was developed as a term project in the class. 


Look ahead In summary, we have demonstrated the source coding theorem for 
general information sources and discovered how to design codes that achieve the 
limit. Additionally, we provided a brief introduction to some practical codes such 
as the Huffman code. The next section will delve deeper into the Huffman code. 
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1.9 Huffman Code and Python Implementation 


Recap We have demonstrated the source coding theorem for general information 
sources (stationary processes in practically-relevant scenarios). We also learned how 
to design optimal codes that attain the limit, which is the entropy rate. Nevertheless, 
our attention has been focused on an idealistic situation in which the super symbol 
size n is infinitely large, while in practice, n must be finite. The good news is that 
an optimal code construction was developed shortly after Shannon established the 
source coding theorem. 


Outline In this section, we will delve into the study of the optimal code, known 
as the Huffman code, which was developed by David Huffman. The inspiration for 
the code came from “thinking outside the box”. Huffman scrutinized an intuitive 
binary code tree and deduced several properties that an optimal binary code tree 
must satisfy. These properties led him to invent a natural algorithm that guarantees 
optimality. We will first examine the crucial properties that an optimal binary code 
tree should possess. Next, we will describe the functioning of the optimal algorithm. 
Finally, we will investigate the implementation of the algorithm using Python. 


An optimization problem for n = 1 Let us start with the simplest case in 
which the super symbol size 7 is 1. Let a symbol X be an M-ary random variable 
with the pmf p(x) in which x € XY = {a}, a2,..., ay}. For notational simplicity, 
we will denote p(a;) by p; for 7 € {1,..., M}. Let £; be the codeword length w.r.t. 
symbol a;. Then, the optimization problem which aims to minimize the expected 
codeword length is: 


M 
myn Da 
i= 
” (1.40) 
at, YF ELEREN 
i=1 
As we learned in Section 1.5, this is a non-convex optimization problem. Especially 
when it involves an integer constraint, it is so called integer programming which is 
known to be notoriously difficult. In general, solving integer programming requires 
searching over all possible candidates for variables. 
Since the intractability of the problem is well-known, in early days, only 
heuristics were suggested. One heuristic proposed by Shannon is to choose ¢; 


as (Shannon, 2001): 
1 
l; = og | : 
Pi 
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People later called this Shannon code. When there is no ceiling, the minimizer is the 
solution to the relaxed optimization problem which ignores the integer constraint. 
Hence, it yields the exact solution when log a8 are integers. In general, log z are 
not necessarily integers, so the problem is not that simple. Actually even until now, 
the closed-form solution for the expected codeword length has been unknown. 


Birth of the Huffman code Fano had developed another heuristic code, which 
was later named Shannon-Fano code (Salomon, 2004), but it was not generally 
optimal. He challenged the students in his information theory course at MIT 
to find a solution to the integer programming problem. Unexpectedly, a student 
named David Huffman came up with a simple algorithm that leads to the optimal 
code, which is now known as the Huffman code (Huffman, 1952). 

Instead of attempting to solve the difficult problem directly, Huffman used a 
creative approach to gain insights from the properties that the optimal code must 
have. He studied the binary code tree and established three key properties that the 
optimal binary code tree should satisfy. These properties allowed him to develop 
an optimal algorithm. 


Properties of optimal prefix-free codes Without loss of generality, assume 
that pı > p2 > --- > pm. Let €*’s be optimal codeword lengths. The first property 
is w.r.t. the relationship between p;’s and ¢*’s: 


Property 1: pi > pj — © < G. (1.41) 


This is very intuitive. Having a larger frequency (higher p;), the corresponding 
optimal codeword length must be smaller (smaller €7). Here is a rigorous proof 
for this seemingly-trivial property. The idea is by contradiction. Suppose 7 > 43 
when pı > p2. The optimal expected codeword length reads: 


M 


L* = pt] + pol + D pill. 
i=3 


On the other hand, consider another expected codeword length, say Ê, w.r.t. the 
case in which we interchange the role of £7 and £3: 


M 


Ê = pit} + pol} + > ne. 
1=3 


We then see that 
£* — £ = pi(€j — £3) — poll} — €3) 
= (pi — p2)(€j — €3) > 0. 
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This contradicts with the hypothesis that £* is the minimum expected codeword 
length. 

The second property is regarding the relationship between the optimal lengths 
w.r.t. the last two least probable symbols: 


Property 2: Cri = Cy. (1.42) 


The immediacy of this statement arises from the fact that a binary code tree pos- 
sesses only two branches. If we assume that €4,_, 4 €%,, there must exist a vacant 
leaf adjacent to the leaf designated for the least probable symbol am. This assump- 
tion is contradictory since a binary code tree with an unoccupied left leaf cannot 
produce the most efficient code. To further illustrate this point, let us examine a 
simple instance in which M equals 4, and the codewords are: 


C(a1) = 0, C(az) = 10, C (a3) = 110, C(a4) = 1110. 


See Fig. 1.15. In this case, one can immediately find a better codeword that yields 
a shorter length: C(a4) = 111. 

The last third property is about the relationship between the last two least prob- 
able symbols: 


Property 3: (am—1, am) can be assumed to be siblings, (1.43) 


meaning that the codewords for am—1 and am can always be mapped to the leaves 
which are neighbors with each other. Here is the proof. Suppose (ajy—1, am) are 
not siblings. Then, there must be another symbol, say am—;, (¢ > 2), such that 
it is a sibling of am. Otherwise, one can immediately find a better code (Why?). 
Clearly ¢7,_; = jy. This together with Property 2 gives: €7,_, = Cy; Hence, 


o (a1) 


1110 (a4) 


Figure 1.15. An example which supports the second property: EM = tm 
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interchangable 


(as) 


Figure 1.16. An example which supports the third property: two least probable symbols 
(a4,a5) can be assumed to be siblings. 


one can interchange the codewords of am—; and ajg—1 without loss of optimality 
(while keeping the expected codeword length). This ensures ajy— to be a sibling 
of am. To see this clearly, consider an example in which M = 5 and codewords are: 


C(ai) = 0, C(a2) = 100, C(a3) = 110, C (a4) = 110, C (as) = 111. 


See Fig. 1.16. In this case, one can interchange the codewords of a3 and a4 so that 
a4 is a sibling of as while maintaining the expected codeword length. 


An optimal algorithm The above three properties (1.41)—(1.43) enabled Huff- 
man to come up with a simple and natural algorithm. Let us explore it by starting 
with the simplest case M = 2. In this case, the second property €4,_, = €}, yields 
an obvious construction: C (a41) = 0 and C(a2) = 1. 

Next consider M = 3. The second property gives £5 = £3. Obviously €5 = 
€3, > 2. Otherwise (i.e., £5 = €3 = 1), f] must be greater than or equal to 2. But 
this violates the first property. Hence, € = 1 and £3 = €3 > 2; and this yields a 
straightforward construction: Assigning 10 and 11 to two least probable symbols 
while mapping 0 to the most frequent symbol: 


C(a) = 0, C(a2) = 10, C(a3) = 11. (1.44) 


Now consider M = 4. With the second and third properties, 3 = €% and 
(a3, a4) can be assumed to be siblings. Consider the internal node associated with 
the two leaves w.r.t. (a3, 44). Let a} be a virtual symbol which represents the two 
symbols (a3, 24). Map @; to that internal node, and define 


Ps = (a3) = p3 + pa. 
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am (a1) \ 


/ 


\ 
Run the algorithm for M=3! 
I 


(a2) cA 


— 
- 


Figure 1.17. The Huffman algorithm for M= 4. 


Let 3 be the codeword length w.r.t. a. This then gives £% + 1 = €3 = €. 
Consider a new set of symbols (41, a2, 43) with (p1, p2, p3). The idea of the Huff- 
man algorithm is to take a recursion: Running the M = 3 algorithm (described 
in (1.44)) for the new set of (a1, a2, a3). See Fig. 1.17. 

The recursive approach outlined here results in the minimum expected length 
for codewords. While the verification process will be elaborated upon later, a com- 
prehensive account of the algorithm for any given M is presented below: 


1. IfM = 3, we run the algorithm described in (1.44). 

2. Otherwise, merge two least probable symbols (am—1, am) to generate a 
virtual symbol @),_, with py; := pm-1 + pm. Run the same algo- 
rithm (performing procedures 1 and 2) for the new set of M — 1 symbols: 
(41; 42+. +,4M—2.4yy_4)- 


We also provide an example for M = 5: 


(P15 P2» P3» p4» Ps) = (0.4, 0.2, 0.15, 0.15, 0.1). (1.45) 


See Fig. 1.18. We first merge two least probable symbols (a4, a5) to generate a 
virtual symbol a, with p4 = 0.15 + 0.1 = 0.25. Next we repeat the same now 
for the new set of four symbols with (p1, p2, P3» #4) = (0.4, 0.2, 0.15, 0.25). We 
merge two least probable symbols (a2, a3) to generate a virtual symbol, say 44, with 
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(a1) 


Figure 1.18. An example of how the Huffman algorithm runs. 


ph = 0.2 + 0.15 = 0.35. This yields another set of symbols with (p1, 75, 24) = 
(0.4, 0.35, 0.25). Lastly we merge a’, and a’, to complete the algorithm. 


Proof of the optimality Let us prove the optimality of the Huffman algo- 
rithm. Let 


M M-2 
L=> piti=pu-ieu-1t+pmem + >: piti 


i=1 i=l 


Here (am—1, am) are the two least probable symbols and we merge them to generate 
dy; With py; = pm-1 + pm. Note that fhg; +1 = £m-1 = £m. Using this, 
we get: 


M-2 
L= pmu + D + pmu t+ D+ >. pili 
S (1.46) 
M-2 
= pm-1 + pm + Puce + > ve} 


i=l 


Let £’ be the expected codeword length w.r.t. the new set of symbols 
(a1,42,...,4M—2,dyy_1). Then, L = pyily1 + XT pili which coin- 
cides with the second bracketed term in the second equality of (1.46). The key to 
observe is that minimizing £’ yields the same solution as that aiming to minimize 
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L, as the left-over term pyy—1 + pm is irrelevant to the optimization variables £;'s. 
Also Kraft’s inequality for the new set of (a1, a2,...,a4M—2, ahga) holds: 


M-2 
27fm-1 ie > 2b: 
i=] 


M-2 
= 9, 9- Get, yar 
i=1 
M-2 
= 27'M-1 4 27M 4 > 7G 


i=1 


M 
= yO <1. 
1=] 


This proves the optimality of the Huffman algorithm. 


Extension to an arbitrary super symbol size n To give you an idea, let us 
consider the case in which the size of the super symbol 7 = 2. Let Z = (X1, X2). 
Then, the probability distribution w.r.t. Z reads: 


pla; 41); p(a1, 42),...,p(am, am). (1.47) 


We can then run the Huffman algorithm w.r.t. Z. The generalization to arbitrary 
n is straightforward. Let Z = (X1, X2,...,X,). Then, the probability distribution 
is defined as in (1.47). Running the same algorithm, we obtain the optimal code. 


Python implementation We explore how to implement the Huffman algo- 
rithm via Python. For illustrative purpose, consider a simple setup where M = 5 
(say a, b, c, d, e) and the probability distribution is the same as in (1.45): 


(P15 P2» P3» p4» Ps) = (0.4, 0.2, 0.15, 0.15, 0.1). (1.48) 


Since the algorithm is based on a binary tree, we start with constructing a binary 
tree class with the top and bottom branches. 


class TreeNode(object): 
def __ init__(self, too=None, bottom=None): 
self.top = top 
self.bottom = bottom 


def children(self): 
return self.top, self.bottom 
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a 


Figure 1.19. A binary code tree constructed by the Huffman algorithm. 


This class is equipped with a method (named children) that returns the values 
assigned to the top and bottom branches. 

Using this tree class, we wish to construct a binary code tree as illustrated in 
Fig. 1.19. We first create a tree (that we named Node 1) by merging the two 
least probable symbols (d, e). To Node 1, we then assign a new probability p4 = 
0.15 + 0.1. Next we repeat the same for the new set of four symbols (a, Node 1, b, c) 
with (p1, P% P2» p3). We build another tree, Node 2 by merging the two least prob- 
able symbols (4, c) in the new set. For another set of symbols (a, Node 2, Node 1) 
with (p1, 75, p4) = (0.4, 0.35, 0.25), we create Node 3, taking Node 2 and Node 1 
as the top and bottom children. Lastly we construct Node 4 (the root node) to 
complete the algorithm. 

For code implementation, we represent the probability distribution via a dictio- 
nary where the keys and values are symbols (or TreeNode) and probabilities, respec- 
tively. To merge two least probable symbols, we sort the dictionary in a descending 
order, thus taking the last two. To this end, we employ sorted function. Here is 
code implementation. 


# probability distribution 

freq={"a":0.4, "b":0.2, "c":0.15, "d":0.15, "e":0.1} 

# Sort in a descending order 
freq=sorted(freq.items(),key=lambda x: x[1],reverse=True) 


# Construct a binary code tree 

while len(freq) > 1: 
# Retrieve the mimimum probability 
(key_b, c_b) = freg[-1] 
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# Retrieve the 2nd minimum probability 

(key_t, c_t) = freq[-2] 

# Build a TreeNode taking minimum and 2nd minimum 
# as the bottom and the top child, respectively 
new_node = TreeNode(key_t,key_b) 

# Construct a new probability distribution with 

# remaining probabilities & that of the new NodeTree 
freq = freq[:-2] 

freq.append((new_node, c_t+c_b)) 

# Sort in a descending order 

freq = sorted(freq, key=lambda x: x[1], reverse=True) 
print(freq) 


[Ca’, 0.4), 

(<__main__.TreeNode object at OxO00001344B434250>, 0.25), 
(’b’,0.2), Cc, 0.15)] 

[Ca’, 0.4), 

(<__main__.TreeNode object at OxO000013544B434A30>, 0.35), 
(<__main__.TreeNode object at OxO000001344B434250>, 0.25)] 
[(<__main__.TreeNode object at OxO00001344B434580>, 0.6), 
Ca’, 0.4)] 

[(<__main__.TreeNode object at OxO00001344B4343A0>, 1.0)] 


The tree construction stops when the newly updated dictionary contains only one 
element. In the above example, we have four steps in total. To see how each step 
works, we print out the updated dictionary. We also illustrate it via Fig. 1.20. 


d 
Step 1: (a, 0.4) (Nodel; 0.25) (b, 0.2) (c, 0.15) 
b d 
Step 2: (a, 0.4) (Node2, 0.35) (Nodel. 0.25) 
Node2 
Nodel 
Step 3: (Node3,0.6)  (a,0.4) 
Node3 
a 
Step 4: (Node4, 1) 


Figure 1.20. How each step works in the Huffman algorithm. 
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Next we construct an encoding function due to the Huffman algorithm. In the 
above example, we wish to generate a dictionary as below: 


{b : 000, c : 001, d : 010,e : 011,4: 1}. 


To this end, we need to traverse all the leaves while producing a corresponding 
binary string w.r.t. each leaf. In the data structure literature, there are two major 
methods for traversing all the leaves in a binary tree: (i) Depth First Search (DFS); 
and (ii) Breath First Search (BFS). Here we take DFS and this can easiliy be imple- 
mented via recursion. Here is code implementation. 

def huffman_code_tree(node, binString=”): 


# returns a dictionary where 
# (key, value) = (symbol, codeword) 


# if node is of string type, return (node,binString) 
if typeCnode) is str: 

return {node: binString} 
(top,bottom) = node.childrenO 
# initialize a dictionary for encoding 
d = dictQ 
# top child: assign a label ’O’ 
d.update(huffman_code_tree(top, binString+’0O’)) 
# bottom child: assign a label °T 
d.update(huffman_code_tree(bottom,binString+’1’)) 
return d 


enc_dict = huffman_code_tree(freq([O][O]) 
printCenc_dict) 


{’b’: 7000’, ’c’: OOT, ’d’: 010’, ’e’: 7017’, ’a’: T} 


Notice that the resultant dictionary matches the desired codewords illustrated in 
Fig. 1.19. 


Limitations of the Huffman code Although the Huffman algorithm provides 
an optimal code in practical scenarios with a finite number of source elements, 
its implementation has certain limitations. One limitation is the requirement for 
knowledge of source statistics to design the Huffman code, which means that the 
joint distribution of a super symbol, p(x), x2, . . . , Xn), must be known. However, 
obtaining this statistical knowledge may not be straightforward in practice. There- 
fore, a pertinent question that arises in this practical scenario is: Is it possible to 
find optimal codes that do not rely on prior knowledge of source statistics? 


Lempel-Ziv code (Ziv and Lempel, 1977, 1978) Lempel and Ziv are two 
individuals who could answer the question regarding whether there are optimal 
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codes that do not rely on prior knowledge of source statistics. They developed a 
universal code called the Lempel-Ziv code, which can be applied to any type of 
information source without requiring statistical knowledge. The code has generated 
significant interest from system designers because it can approach the limit, i.e., the 
entropy rate. The Lempel-Ziv code has been implemented in various systems under 
different names, such as gzip, pkzip, and UNIX compression. 

The basic idea behind the Lempel-Ziv code is straightforward and is something 
that people use in their daily lives. To illustrate the idea, let us consider the context 
of closing email phrases. People often use similar phrases such as “I look forward 
to your reply,” “I look forward to seeing you,” “I look forward to hearing from 
you,” and “Your prompt reply would be appreciated.” If we were to compress this 
English phrase as it is, it would require more bits than the number of alphabets 
in the phrase. However, the Lempel-Ziv code suggests that we can compress the 
phrase much more effectively using a dictionary. Since people use only a few closing 
phrases, we can create a dictionary that maps each phrase to an index like 1, 2, 3. 
This dictionary serves as the basis for the Lempel-Ziv code. 

The Lempel-Ziv code operates in the following way: Firstly, a dictionary is cre- 
ated from a pilot sequence that forms part of the entire sequence. This dictionary 
is then shared between the source encoder and decoder. The source encoder only 


closing email phrases 


| look forward to your reply. dictionary 
| look forward to seeing you. 

| look forward to hearing from you. 

| look forward to hearing from you soon. 

| look forward to meeting you next Tuesday. 

| look forward to seeing you next Thursday. 

We look forward to welcoming you as our customer. 

| look forward to an opportunity to speak with you personally. 

| look forward to a successful working relationship in the future. 

| hope to get answers from you. 

Good luck and | look forward to your response! 

If you require any further information, feel free to contact me. 

If you have any questions, please don't hesitate to contact us. 

Should you need any further information, please do not hesitate to contact me. 
| would appreciate your immediate attention to this matter. 

Your prompt reply is very much appreciated. 

Please contact us again if we can help in any way. 

Please contact us again if there are any problems. 

Please contact us again if you have any questions. 


OANOO RWHD = 


Figure 1.21. The idea of the Lempel-Ziv code. We construct a dictionary (e.g., assign- 
ing indices, marked in blue, for frequently used email phrases) using only part (pilot 
sequence) of the entire sequence. We then share the dictionary between source encoder 
and decoder. The decoder receives the indices and recovers the original sequence using 
the shared dictionary. This code is shown to achieve the limit (the entropy rate) as the 
dictionary size increases. 
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encodes the indices, while the decoder decodes the indices using the shared dictio- 
nary. By following this process, it is not necessary to have knowledge of the statistics 
of the information sources. Moreover, it has been demonstrated that this code can 
approach the limit, i.e., the entropy rate, as the dictionary size increases. However, 
since the focus of this book is not on this code, we will refrain from delving further 
into its details. 


Look ahead Thus far, we have demonstrated the source coding theorem for gen- 
eral information sources and acquired knowledge about optimal code design. Addi- 
tionally, we have gained insight into constructing an explicit optimal code for finite 
n, which is the Huffman code, along with its implementation using Python. These 


constitute the contents of Part I. In the ensuing section, we will commence with 
Part II. 
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Problem Set 3 


Prob 3.1 (Markov’s inequality) Consider a non-negative random variable X 
and a positive value a > 0. 


(a) Show that a- 1{X > a} < X where 1{-} denotes an indicator function. 
(b) Using part (a), prove Markov’s inequality: 


E[X] 
a 


P(X > a) < 


Prob 3.2 (Chebychev inequality) Let X be a discrete random variable with 


mean u and variance a” < oo. Show that for any t > 0, 


2: 
oO 
PIX- Hl 20) <<. 


Prob 3.3 (Weak Law of Large Numbers) Let {Y;} be ani.id. discrete random 
process with mean yu and variance ac? < 00. 


(a) State the Weak Law of Large Numbers (WLLN) w.r.t. {Yj}. 
(6) Prove the WLLN. 


Prob 3.4 (Typical sequences) Let {X;} be an i.i.d. random process where X; € 
X. Let X” := (X1, X2, ..., Xn). In Section 1.7, we claimed the following for an 
arbitrary alphabet case |% | = M: 

1 1 in prob. 


Ps 08 KN — H(X). (1.49) 


(a) Explain the meaning of the convergence in prob. in (1.49). 
(b) Prove that (1.49) holds indeed. 
(c) We say that a sequence x” is €-typical if it satisfies: for € > 0, 


IHOHO < p(y) < 2700-0, 
Define a typical set that contains the typical sequences as elements: 
AP i= fx” : 2E < p(x”) < 27E- (1.50) 
Show that for any € > 0, 
P(A”) := P(X” e AM) — 1asn > o. 


(d) Show that for any € > 0, AP? | < 2”00+9), 
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Prob 3.5 (An example of e-typical sequences) Let {X;} be an i.i.d. ternary 
random process: 


1, wp. p; 
X;= 42, wp. q; 
3, wp.l—-p-4. 
We say that a sequence x” is €-typical if it satisfies: for € > 0, 


IPAR) < p(x) < EN, (1.51) 


Fix € = 0.01 and (p, g) = (1/2, 1/4). Consider a sequence x” such that 


# of 1’s 
= p + 0.05; 
n 
#of2s _ 0.03: 
n =4 war 
# of 3’s 
= 1 — p — q — 0.02. 
n 


Is x” €-typical? Also explain why. 


Prob 3.6 (A choice of codeword length) Let {X;} be an i.i.d. random process 
where X; € X. Fix € > 0. In Section 1.7, we derived the following with the help 
of the WLLN: 


n(H(X) — €) < log 7 


< n(H(X) + €) w.h.p. (1.52) 
p”) 


where w.h.p. stands for “with high probability”. This motivated us to set the length 
of every codeword being equal to: 
L) = [HXH E] Vx" E AM (1.53) 
where A% is a typical set defined as: 
AD i= foc, HOTS < pg") < 27O- (1.54) 


We also verified the validity of this choice by demonstrating that the number of 
leaves that codewords can be mapped in the design of a prefix-free code is greater 
than or equal to 1A” |. 

On the other hand, one may suggest another choice: 


C(x”) = [aH (X) -E Vx" E AM. (1.55) 
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In this problem, you are asked to prove that this choice is invalid. 


(a) Fix € > 0. Show that for sufficiently large 7, 


X px”) > 1-e. (1.56) 


xe A” 


(b) Fixe > 0. Show that for sufficiently large 7, 
|AM| > (0 -= 6)2" 00-9, (1.57) 


(c) Using part (b) or otherwise, show that the choice (1.55) on the codeword 
length is invalid in the limit of z. 


Prob 3.7 (Source code design for an i.i.d. source) Let {X;} be an i.i.d. ran- 
dom process where X; € X. Fix € > 0. In this problem, you are asked to construct 
a super symbol-based code of size n that achieves the expected codeword length (per 
symbol) of H(X) + € in the limit of . Notice that it can achieve the fundamental 
limit H(X) (promised by the source coding theorem) with € — 0. 


(a) Let AP bea typical set defined as (1.54). Consider a prefix-free code C such 
that the first bit of C(x”) is assigned 0 if x” € A”; assigned 1 otherwise. 
The pattern of the remaining bits of C(x”) is specified by a binary code 
tree constructed as follows. To the internal node associated with the first 
upper branch (labeled 0), we attach a full subtree of depth [n(A (X) + €)] 
so that the total number of leaves in the top subtree is 2!” CO+91, If x” € 
AL”, we map the typical sequence into one of the leaves in the top subtree. 
Different sequences are assigned distinct leaves. 

To the other internal code associated with the first ower branch 
(labeled 1), we attach another full subtree of depth [log ||] so that the 
total number of leaves in the bottom subtree is 2!”!8!*1], If x” ¢ AP, 
we map the non-typical sequence into one of the leaves in the bottom sub- 
tree. Show that this prefix-free code is valid, i.e., there are enough leaves for 
mapping all possible sequences x”’s. 

(b) Suppose we employ the prefix-free code in part (a). Show that 


ELE (Xx) 
n 


lim 
n= 10, @) 


= H(X)+e 


where €(X”) indicates the length of C(X”). 


Prob 3.8 (Convergence of a non-increasing & non-negative sequence) 
Consider a deterministic sequence {a;} satisfying the following two properties: 
(i) a; > 0 and (ii) a; > aj41. 
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(a) State the definition of the existence of a limit. 
(b) Show that a; converges to a limit: lim; oo a;(=: a). 


(c) Show that 
li 1 5 = 
$ im P = 4i = d. 


Prob 3.9 (Source code design for a generalized Markov process) Let 
{X;} be a generalized Markov process with two memory states: 


Palins os tee 0) = phsnlsn—1) 


where sn := (Xn—1, Xn). Suppose that 


P(S„ = 00|S,—1 = 00) = P(S, = 11|Sp-1 = 11) = p, 


P(S, = 10|Sy—-1 = 01) = P(S, = 01|S„—1 = 10) = 0.5. 


(a) Show that the minimum number of bits that represent the above informa- 
tion source per symbol is 


H(X) = lim H(X, X’). 
n= CO 


In Section 1.8, we introduced a terminology for the quantity H(4’). What 
is it? Also explain the rationale behind the naming. 

(b) Using part (a), show that H(¥) = HA(X3|X1,X2). Also compute 
A(X3|X), X2). 

(c) Define a typical set: 


AY = x” ‘ 9 MAS |X1 X02) +6) < p(x”) < 2770X), 
It has been verified that for any € > 0, 
P(A”) = P( e eA”) — lasn- o. 


Using this together with part (4), construct a prefix-free code (i.e., draw 
a binary code tree) in which the expected codeword length (per symbol) 
approaches H (X3|X1, X2) as n —> oo. Also show that the expected code- 
word length of your code indeed achieves the limit H (X3|X1, X2). 


Prob 3.10 (WLLN of a stationary process) Let {X;} be a stationary pro- 
cess. Suppose that u := ELX;]; a, = E[(X; — w) (Xie — w)], Vk € N; and 


you, < o. 
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(a) Prove that 


es fe re a 
a (ee musa —> u in prob. (1.58) 
n 


(b) As mentioned in Section 1.8, (1.58) is a generalized version of the Weak 
Law of Large Numbers (WLLN). Using this, show that 
1 


z 1°08 TT — H(X) in prob. 


Here X” := (X1, X2, . .., Xn). 


Prob 3.11 (Statistics of optimally compressed output) Consider an infor- 
mation source $1, S2, . . ., Sa. Suppose we use an optimal source code that maps 
(S1, S2, . . . , Sn) into a binary string (01, b2, . . . , bm). Compute H (b1, b2,..., bm). 
Hint: You may want to assume that 7 is sufficiently large. 


Prob 3.12 (Entropy rate) Let X = {X„ n € Z} be a stationary sequence of 
random variables taking values in a finite alphabet X. Define conditional entropy: 


1 
Ay (X) := yf Gan <- -o X1 | XL - -3 X1). 


The confused information theorist claims that 


jim a HX) = H(X) 


where H(X) denotes the entropy rate of the source. Prove or disprove this claim. 


Prob 3.13 (Entropy rate and Markov chain) The standard nomenclature 
designates the four quadrants of the Euclidean plane as NE, NW, SE, and SW 
(i.e., northeast, northwest, southeast, and southwest). The particle travels between 
these quadrants, moving equiprobably either vertically or horizontally at each time, 
which means that it switches quadrants with equal probability. For example, if it 
currently occupies the NE quadrant, there is an equal probability that it will be 
in the SE quadrant or the NW quadrant at the next time instant. The particle's 
movement is denoted by the standard directional labels N, S, W, or E, with the 
move from NE to SE labeled as S and the move from SW to SE labeled as E, for 
instance. At time 0, the particle is equally likely to be in any of the quadrants. The 
{Xn n > 0} defines the quadrant in which the particle is located at time 7, whereas 
{Yn n > 0} provides the label for the move made by the particle from time 7 to 
time n+ 1. 


(a) Is {Xn n > 0} a Markov chain? Either prove or disprove it. 
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(b) Is {Xn n > 0} stationary? Either prove or disprove it. 
(c) Find the mutual information rate between {X,, n > 0} and {Y,,” > O}: 


1 
lim —1(X‘; Y”). 
l= 00 L 


(d) Is {Y,, > 0} a Markov chain? Either prove or disprove it. 
(e) Is {Y,, > 0} stationary? Either prove or disprove it. 


Prob 3.14 (Huffman code construction) Consider a random variable with 
the probability distribution: 


x= x] x2 X3 X4 x5 x6 X7 
0.5 0.25 0.2 0.02 0.01 0.01 0.01)" 


(a) Construct a binary Huffman code for X. 
(b) Construct a ternary Huffman code for X. 


Prob 3.15 (Optimality of the Huffman code) Let X e {a1 a2,..., am} bea 
discrete random variable where M > 3 is an integer. Let p; be a pmf of X where 
i € {1,2,..., M}. Without loss of generality, assume that pj > p2 > +--+ > pm. 
Consider a source code which takes X as an input. Let £; be the codeword length 
w.r.t. dj. Describe the Huffman algorithm. Prove the optimality of the algorithm. 


Prob 3.16 (Principles of the Huffman code) A student claims that if the most 
probable letter in an alphabet has probability less than 7 then any Huffman code 
will assign a codeword length of at least 2 to that letter. Prove or disprove the claim. 


Prob 3.17 (Principles of the Huffman code) Let X be a random variable 
taking values in the finite set {1, 2,3}. Let €(X) denote the expected length of an 
optimal Huffman code for X. A student claims that if P(X = 7) > 0 for all 
i= 1,2,3, then £(X)— H(X) > 3 —log 3. Either prove or disprove this statement. 


Prob 3.18 (True or False?) 


(a) Let {X;} be a random process. We say that X, converges to X in probability 
if for any € > 0, 


P(X, -X| <€)> 1 asn> oc. 


(b) For a discrete random variable X and € > 0, consider a typical set: 


AW r= {x”: pee) 2 pix") < gn) 
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(c) 


(d) 
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Then, 
Aa a, 


Let {X;,} be an i.i.d. sequence of binary random variables, each equally likely 
to be 0 or 1. Define 


Y, := |{j:1 <j < am, X = 1}, n=1,2,... 


Then, the entropy rate of the process {Y,,} is 1 bit/symbol. 


Consider discrete random variables X and Y. Suppose that X and Y are 
independent. Then, the expected codeword length of a Huffman code for 
the pair (X, Y) is at least as large as the sum of the expected codeword 
lengths of individual Huffman codes for X and Y. 


Let X € X bea discrete random variable with pmf p(x). In Section 1.5, we 
consider the following optimization problem: 
min S| peo) : 
CO) en 
xeX 
E(x) EN. 


The closed form solution to the problem has been open thus far. Hence, no 
one knows how to construct an optimal €* (x) that minimizes the objective 
function. 


DOE: 10.1561/9781638281153.ch2 
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2.1 Statement of Channel Coding Theorem 


Recap In Part I, we delved into Shannon’s communication architecture consist- 
ing of two stages. The first stage, source coding, aims to convert an information 
source, which may be of a different type, into bits, a common information cur- 
rency. The second stage, channel coding, converts these bits into a signal that can 
be transmitted reliably over a channel to the receiver. We explored the source cod- 
ing theorem, which determines the maximum compression rate of an information 
source, and learned how to design optimal codes that achieve this limit. Specifically, 
we investigated the Huffman code and practiced implementing it in Python. 

Moving onto Part II, we shift our focus to the second block in Shannon’s archi- 
tecture. Here, we will examine the channel coding theorem, which delineates the 
fundamental limit on the number of bits that can be transmitted reliably over a 
channel. 


Outline Recall the statement we made in Section 1.1 regarding the channel cod- 
ing theorem. Similar to the laws of physics, there is a fundamental law in com- 
munication systems. It states that the maximum amount of information that can 
be transmitted over a channel is fixed, regardless of any operations performed by 
the transmitter or receiver. To put it differently, communication systems have a 


91 


92 Channel Coding 


law similar to that of physics. The maximum number of bits that can be trans- 
mitted over a channel is determined, regardless of any operations performed at the 
transmitter and receiver. This means that there is a fundamental limit on the num- 
ber of bits that allows communication and beyond which communication becomes 
impossible, no matter what we do. The limit is known as the channel capacity. This 
section aims to provide a more precise understanding of this statement by examin- 
ing (i) a specific problem scenario that Shannon investigated, (ii) a mathematical 
representation of channels, and (iii) the mathematical definition of communication 
that is possible or impossible. Once we have a clear understanding of the statement, 
we will proceed to prove the theorem. 


Problem setup The objective of communication is to transmit binary string 
information (bits) to the receiver as much as possible. Our investigation begins 
with analyzing the statistics of the binary string. Unlike the information source in 
the context of source coding, it is possible to assume simple statistics for the binary 
string. In the source coding scenario, the information source has arbitrary statistics 
that depend on the application of interest, and source code design is customized 
accordingly. In contrast, in channel coding, there is some good news; the input 
binary string’s statistics are not arbitrary. Under a reasonable assumption, it follows 
a particular distribution. This reasonable assumption is that we use an optimal 
source code. 

To determine the specific statistics, we can use the source coding theorem. Sup- 
pose the binary string (b1, b2, . . . , bm) is the output of an optimal source encoder. 
In that case, we can apply the source coding theorem to obtain: 


m = H (information source) 


= H (b1, b2,..., bm) 


(2.1) 


where the second equality follows from the fact that a source encoder is one-to-one 
mapping and the fact H(X) = H(f(X)) for an one-to-one mapping function f. 
Why? Now observe that 


H (bi, b2,. : -> bm) = H(b:) + H(b|01) +.. + A(b,|61,. ei , bm—1) 
< H (b1) + H (b2) +--+ H (bm) (2.2) 


<m 


where the first inequality is due to the fact that conditioning reduces entropy 
and the last inequality comes from the fact that b;s are binary random vari- 
ables. This together with (2.1) suggests that the inequalities in (2.2) are tight. 
This implies that ġ;s are independent (from the first inequality) and identically 
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channel channel 
encoder decoder 


g 


Figure 2.1. Channel coding problem setup. 


distributed ~ Bern(ż) (from the second inequality). Hence, we can make the fol- 
lowing assumption: a binary string, an input to a channel encoder, is i.i.d., each 
being according to Bern(}). 

For notational simplicity, information theorists introduced a simple notation 
which indicates the i.i.d. random process {4;}. They expressed it with a single ran- 
dom variable, say W, using the following mapping rule: 


(bis... bm) = (0,0,...,0) — W = 1; 
(b1,...; bm) = (0,0,...,1) — W = 2; 


(b.s bm) = (L,1,...,1) — W = 2”, 


This allows us to express the input simply with W having the uniform distribution. 
Why uniform? 


What to design The design of the digital communication system involves 
designing two components: (i) a channel encoder, say f; and (ii) a channel decoder, 
say g. See Fig. 2.1. One crucial consideration in the design process is the channel’s 
behavior. As previously mentioned, the channel is an adversary that introduces 
errors, turning the system into a random function. To combat these errors, the 
encoder and decoder must be designed to provide protection. One way to accom- 
plish this is by adding redundancy, which is akin to repeating the transmission when 
communication fails. Building on this idea, we can represent the channel encoder’s 
output as a sequence of symbols (X1, X2, . . . , Xn), where 7 is the code length. This 
n is distinct from the super-symbol size utilized in the source coding context. We 
use the shorthand notation X” := (X1, X2, . . . , Xn) to represent this sequence. We 
denote Y” as the channel’s output, and the decoder g’s objective is to infer W from 
Y”. The challenge here is that Y” is not a deterministic function of X”. If the chan- 
nel were deterministic, g would be the inverse function of the concatenation of f 
and the channel. However, as the channel is not deterministic in reality, designing 
g is not straightforward. In fact, understanding the channel’s behavior is critical in 
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designing g. Therefore, before delving into details on f and g, we must discuss how 
the channel behaves and how to model it briefly. 


Channel modeling Let X; and Y; be the input and the output of the channel 
at time 7, respectively. One typical way to capture the randomness of the channel 
is via conditional distribution p(y|x). To give you an idea, consider one concrete 
example: binary erasure channel (BEC). In the BEC, Y; takes “erasure” (garbage) 
with probability, say p (called the erasure probability), while taking the input X; 
cleanly otherwise. So the relation between X; and Y; is given by: 


Xi wp. 1l—p; 
y= P Ê 
e wp.p 


where e stands for “erasure”. Usually we consider a mememoryless channel in which 
this relationship is independent across different time instants. 


Two performance metrics Let us investigate how to design f and g such that 
communication is possible. To this end, we first need to understand what it means 
by possible communication. To explain what it means, we will introduce two per- 
formance metrics. 

The first performance metric captures the amount of bits that we wish to trans- 
mit. Suppose W € {1, 2, . . ., M}. Then, the total number of bits that we intend to 
send is log M. Since code length is 7, the number of channels (time instants) that 
we use is 7 and hence, the number of bits transmitted per channel use (called data 
rate) is 


log M 
2817 bits/channel use. 


R= 
The second performance metric is w.r.t. the decoding quality. Since the channel is 
random, we cannot always guarantee the decoded message (say W) to be the same 
as the original message W. Hence, the decoding quality can be captured by the 
probability of an error event: 


P, := P(W £ W). 
The smaller P,, the better the decoding quality. 


Optimization problem Shannon devised an optimization problem using the 
two performance metrics and introduced the concept of possible communication, 
which will be explained shortly. Let us first take a look at the optimization problem. 
One can expect that there should be tradeoff relationship between R and P,. The 
larger R, the larger P, and vice versa. So one natural optimization problem is the 
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following. Given R and z: 


Pž := min P}. 
fg 
What Shannon realized is that unfortunately, this is again a non-convex optimization 
problem, being very difficult to solve. Shannon took a different approach, as he did 
for the source coding theorem. This is, to approximate. In other words, he attempted 
to develop upper and lower bounds on P*. 

In the process, he discovered something interesting, which then led to the con- 
cept of possible communication. What he found is that when R is below a certain 
threshold, an upper bound on P* can be made arbitrarily close to 0 as 7 increases. 
This implies that in the case, the actual P* can be made very small. He also found 
that when R is above the threshold, a lower bound on P? cannot be made arbitrarily 
close to 0 no matter what we do. This implies that the actual P* cannot be made 
arbitrarily small. 

This observation led him to come up with the concept of possible communica- 
tion. We say that communication is possible if one can make P* arbitrarily close to 
0 as n + œ; otherwise communication is said to be impossible. He also came up 
with the related concept of achievable rate. We say that data rate R is achievable if 
we can make P* — 0 as n + œ given R; otherwise R is said to be not achievable. 


Channel coding theorem Moreover, Shannon made an interesting observa- 
tion. What he observed is that the threshold below which communication is pos- 
sible and above which communication impossible is sharp. In other words, there 
is a sharp phase transition on the achievable rate. He then called the limit channel 
capacity and denoted it by C. This forms the channel coding theorem: 


The maximum achievable rate is channel capacity C. 


From the next section onwards, we will prove this theorem. Specifically we will 
prove the following two: 


R<C=P,750 
R > C = P, > 0. 


The first is called the achievability proof or the direct proof. The second is called 
the converse proof. Why converse? Notice that the contraposition of the second 
statement is: 


P. > 0 => R<C, 


which is exactly the reverse of the first statement. 
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Look ahead In the upcoming section, we will prove the achievability for a sim- 
ple example. By deriving insights from this example, we will subsequently prove 
the achievability for a broader class of channels, determined by any arbitrary 
conditional distribution p(y|x). 
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2.2 Achievability Proof for the Binary Erasure Channel 


Recap In the preceding section, we presented the framework for the channel cod- 
ing problem. We formally stated the channel coding theorem, which was initially 
introduced in a vague manner at the start of this book. Two performance metrics 
were considered: (i) the data rate R, defined as ee (bits/channel use); and (ii) 
the probability of error P,, defined as P(W A W). Here, W belongs to the set 


{1,2,..., M}, and z represents the code length (i.e., the number of channel uses). 


To investigate the tradeoff between these two metrics, we formulated the following 
optimization problem. Given R and z: 


P* = min P, 


€ 
fg 


where f and g indicate channel encoder and decoder respectively. Unfortunately, 
Shannon could not solve this problem. Instead he looked into upper and lower 
bounds. In the process, he made an interesting observation. If R is less then a thresh- 
old, say C, then P, can be made arbitrarily close to 0 as n — 00; otherwise, P, + 0 
no matter what we do. This leads to the natural concept of achievable rate. The data 
rate R is said to be achievable if we can make P, — 0 for a given R. This finally 
formed the channel coding theorem: 


Maximum achievable rate = C. 


The channel coding theorem requires the proof of two parts. The first is: R < 
C => P, > 0. To this end, we need to come up with an achievable scheme which 
yields P, —> 0 given R < C. This is called the achievability proof. The second 
part to prove is: R > C => P, + 0. The contraposition of this statement is: 
P, > 0 => R < C, which is the opposite of the statement for the first part proof. 
Hence, it is called the converse proof. 


Outline In this section, our goal is to prove the achievability of the channel coding 
theorem. To achieve this, we will examine a simple example of a channel known 
as the binary erasure channel (BEC). This example provides valuable insights into 
proving the achievability of more complex channels. Once we have acquired enough 
insights from the BEC, we will apply them to a fairly general channel model setting. 


Binary erasure channel (BEC) Remember that the channel output reads: 


Xi wp. l—p; 
n=] p.1—p 


e  w.p.p 
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where e stands for “erasure” (garbage) and p indicates the erasure probability. As 
mentioned earlier, we consider a memoryless channel in which noises are indepen- 
dent across different time instances. 

First of all, let us guess what the channel capacity C is. One can make a naive 
guess based on the following observation. The channel is perfect w.p. 1 — p; erased 
w.p. p; and the erasure (garbage) does not provide any information w.r.t. what we 
transmit. This naturally leads to: C is at most 1 — p. Then, a question arises: Is 
this achievable? Think about one extreme scenario in which the transmitter knows 
all erasure patterns across all time instances beforehand. Of course this is far from 
reality. But for simplicity, consider this case for the time being. In this case, one 
can readily achieve 1 — p. The achievable scheme is to transmit a bit whenever the 
channel is perfect. Since the perfect channel probability is 1 — p, by the WLLN, 
we can achieve 1 — p as n tends to infinity. 

Let us consider a realistic scenario in which we cannot predict the future events 
and thus the transmitter has no idea of erasure patterns. In this case, we cannot 
apply the above naive transmission scheme because each transmission of a bit is not 
guaranteed to be successful due to the lack of the knowledge of erasure patterns. 
One may imagine that it is impossible to achieve 1 — p. Interestingly, one can still 
achieve 1 — p even in this realistic scenario. 


How to encode? Here is a transmission scheme. Fix R arbitrarily close to 1 — p. 


Given R, what is M (the cardinality of the range of the message W)? Since R := 
log M 


, one needs to set it as: M = 2”*, The message W takes one of the values 
among 1,2,..., 27, 

Next, how to encode the message W? In other words, what is a mapping rule 
between W and X”? Here we call X” codeword’. Shannon’s encoding rule, to be 
explained in the sequel, enables us to achieve R = 1 — p. Shannon's encoding 
is simple. The idea is to generate a random binary string given any W = w. In 


other words, for any W = w, every component of X”(w) follows a binary random 
1 
2 
are i.i.d. ~ Bern(ż) across all z’s. These are i.i.d. also across all w’s. This looks like 


variable with parameter 5 and those are independent with each other, i.e., X;(w)’s 
a dumb scheme. Surprisingly, Shannon showed that this dumb scheme can achieve 
the rate of 1 — p. We have a terminology indicating a collection (book) of X;(w)’s. 
It is called a codebook. 


How to decode? Let us move onto the decoder side. The decoder input is a 
received signal Y” (channel output). One can make a reasonable assumption on 


1. The terminology “codeword” was used to indicate the output of the source encoder in Part I. By convention, 
people employ the same terminology for the output of channel encoder. 
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the decoder. The codebook, a collection of X;(w)’s, is known. This assumption 
is realistic because this information can be shared only at the beginning of com- 
munication. Once this information is shared, we can use this all the time until 
communication is terminated. 

How to decode W from Y”, assuming the knowledge of the codebook? What 
is an optimal way of decoding W? Remember the second performance metric: the 
probability of error P, := P(W + W). Due to the stochastic nature of the channel, 
the best we can hope for is to minimize the probability of error. Thus, an optimal 
decoder is one that achieves the minimum possible probability of error. Alterna- 
tively, the optimal decoder can be defined as the one that maximizes the success 
probability, P(W = W). Since the received signal is known (as it is an input to 
the decoder), the success probability can be defined as the conditional probability 
of correctly decoding the message, given the received signal: 


P(W = W|Y" =y’"). 


Then, a more formal definition of the optimal decoder is: 


A 


W = arg max P(W = w| Y” = y”). 


We have a terminology for the optimal decoder. Notice that P(W = w| Y” = y”) is 
the probability after an observation y” is made; and P(W = w) is called the a priori 
probability because it is the one known beforehand. Hence, P(W = w| Y” = y”) 
is called the a posteriori probability. Observe that the optimal decoder is the one 
that Maximizes A Posteriori (MAP) probability. So it is called the MAP decoder. 
The MAP decoder acts as an optimal decoder for many interesting problems not 
limited to this problem context. As long as a problem of interest is an inference 
problem (infer X from Y when X and Y are probabilistically related), the optimal 
decoder is always the MAP decoder. 

In fact, this MAP decoder can be simplified further in many cases including ours 
as a special case. Here is how it is simplified. Using the definition of conditional 
probability, we get: 


P(W = m, Y” =y" 
saraje aray 


P(Y” = y”) 
PW = 
= PVE) ppi 
(Y" = y”) 
Notice that P(W = w) = sR as it is irrelevant to w. Also P(Y” = y”) is not a 


function of w. This implies that it suffices to consider only P(Y” = y"|W = w) 
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in figuring out when the above probability is maximized. Hence, we obtain: 


A 


W = arg max P(W = w| Y” = 7”) 


= arg max P(Y” = y"|W = w). 


Given W = w, X” is known as x” (w). Hence, 


W = arg max P(Y” = y"|W = w) 


= arg max P(Y” = y"|X” = x” (w), W = w). 


Also given X” = x” (w), Y” is a sole function of the channel, meaning the inde- 
pendence between Y” and w. Hence, 


W= arg max P(Y” = y"|X” = x” (w), W = w) 


= arg max P(Y” = y"|X” = x” (w)). 


Notice that P(Y” = y"|X” = x”(w)) is nothing but conditional distribution, 
which is easy to compute. There is another name for the conditional distribution: 
likelihood. So the decoder is called the Maximum Likelihood (ML) decoder. 


How to derive the ML decoder? The ML decoder is simple in our problem 
context. Let us see this through a simple example where n = 4 and R = 7 In this 


example, M = 2”* = 4. Consider a particular codebook: 
X"(1) = 0000; 
X"(2) = 0110; 


X”(3) = 1010; 
X"(4) = 1111. 


Suppose y” = 0e00. Then, one can compute the likelihood P(y”|x”(w)). For 


instance, 


P(y"|x"(1)) = (1 — pp. 


The channel is perfect three times (time 1, 3 and 4), while being erased at time 2. 
On the other hand, 


P(y"|x"(2)) = 0 
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because the third bit 1 does not match y3 = 0, and this is the event that would 
never happen. In other words, the second message is incompatible with the received 
signal y”. Then, what is the ML decoding rule? Here is what the rule says. 


1. Eliminate all the messages which are incompatible with the received signal. 
2. If there is only one message that survives, declare the survival as the correct 
message that the transmitter sent. 


However, this procedure is not sufficient to describe the ML decoding rule. The 
reason is that we may have a different erasure pattern that confuses the rule. To 
see this clearly, consider the following example. Suppose the received signal is now 
y” = (0ee0). Then, 


X"(1) = (1 -pp 
X"(2) = (1 —p)*p”. 


The two messages (1 and 2) are compatible and those likelihood functions are equal. 
In such cases, the best we can do is to randomly choose one out of the two options. 
This forms the ML decoding rule. 


3. If there are multiple survivals, choose one randomly. 


Look ahead In the next section, we will demonstrate that we can achieve the rate 
of 1 — p under the optimal ML decoding rule. 
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2.3 Achievability Proof for the Binary Symmetric 
Channel 


Recap In the prior section, we claimed that the capacity of the BEC with erasure 
probability p is 


Cpec =] — p. 


We then attempted to prove the achievability: if R < 1—p, we can make P, arbitrar- 
ily close to 0 as n 00. We employed a random codebook where each component 
X;(w) of the codebook follows Bern(+) and is i.i.d. across all 7 € {1,...,”} and 
w € {1,...,2”%}. We also employed the optimal decoder: the maximum likeli- 
hood (ML) decoder in our problem context where the message W is uniformly 
distributed. Next, we intended to complete the achievability proof, by showing 
P, — 0 under the problem setup. 


Outline In this section, we will finish the proof and move on to another chan- 
nel example that provides additional insights into the general channel setting. The 
channel we will focus on is called the Binary Symmetric Channel (BSC). 


Probability of error The probability of error is a function of codebook C. So 
we are going to investigate the average error probability Ec [P,(C)] taken over all 


possible random codebooks, and will demonstrate that E¢[P.(C)] approaches 0 as 
n —> oo. This then implies the existence of an optimal deterministic codebook, 
say C*, such that P,(C*) > 0 as n > oo. Why? See Prob 4.3 for the proof of the 
existence. Consider: 


EclPC) Ê YPC = OPV 4 WIC = o 


O ww) (2.3) 
gunk 


2 Pw = wP (W # wW = w) 


w=1 


where (a) follows from the fact that P(C = c) indicates the probability that code- 
book is chosen as a particular realization c; (4) and (c) are due to the total proba- 
bility law. 

Now consider P(W + w|W = w). Notice that [X1 (w),...,X,(w)]’s are iden- 
tically distributed over all w’s. The way that the wth codeword is constructed is the 
same for all w’s. This implies that P(W + w|W = w) is irrelevant of what the 
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particular value of w is, meaning that 
P(W £ w|W = w) is the same for all w. 


This together with (2.3) gives: 


Ec(P.(C)] = P(W 4 1|W = 1) 


ae (2.4) 
=P || JP =ar = 


w=2 


where the second equality is due to the fact that {W Æ 1} means that W = w for 
some w # 1. In general, the probability of the union of multiple events is not that 
simple to compute. Rather it is quite complicated especially when it involves a large 
number of multiple events. Even for the three-event case (A, B, C), the probability 
formula is not simple: 


P(AUBUC) = P(A) + P(B) + P(C) 
— P(AN B) = P(ANC) —P(BNC) + P(ANBNC). 


Even worse, the number of associated multiple events in (2.4) is 2% — 1 and this 
will make the probability formula very complicated. So the calculation of that prob- 
ability is involved. To make some progress, Shannon took an indirect approach as 
he did in the proof of the source coding theorem. He did not attempt to compute 
the probability of error exactly. Instead, he intended to approximate it. He tried to 
derive an upper bound because what we want to show at the end of day is that the 
probability of error approaches 0. Note that if an upper bound goes to zero, the 
exact quantity also approaches 0. We have a very well-known upper bound w.r.t. 
the union of multiple events. That is, the union bound: for events A and B, 


P(AUB) < P(A) + P(B). 
The proof is immediate. It is because P(A N B) > 0. 
Now applying the union bound to (2.4), we get: 


gnR 


EcLP.(C)] < PCW = wW = 1) 


w=2 


= (2? _ 1)P(W = 2|W = 1) 


where the equality follows from the fact that the codebook employed is symmetric 
w.t.t message indices. The event of W = 2 implies that message 2 is compatible; 
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otherwise, W cannot be chosen as 2. Using the fact that for two events A and B 
such that A implies B, P(A) < P(B), we get: 


Ee [P.(C)] < om. 1)P(message 2 is compatible W = 1) 
< 2”*P(message 2 is compatible|W = 1). 


Message 2 being compatible implies that for time 7 in which the transmitted signal 
is not erased, X;(2) must be the same as X;(1); otherwise, message 2 cannot be 
compatible. To see this clearly, let 6 = {i : Y; # e}. Then, what the above means 
is that we get: 


iron < 22 (You) xow =1) 


icb 


|B| 
1 
G) 


where the equality is due to the fact that P(X; (1) = X;(2)) = 5 and the codebook 
is independent across all 2’s. 
Observe that due to the WLLN, for sufficiently large 7: 


(2.5) 


IBI ~ a(l — p). 


This together with (2.5) yields: 


EcLP.(C)] < E0), 


Hence, Ee[P-(C)] can be made arbitrarily close to 0 as n —> 00, as long as R < 


1 — p. This completes the achievability proof. 


Binary symmetric channel (BSC) Before moving on to arbitrary channels, 
let us delve deeper into the Binary Symmetric Channel (BSC) to gain more under- 
standing. While the techniques used in proving achievability for the Binary Erasure 
Channel (BEC) are not sufficient for generalization, the techniques we will learn 
in the BSC achievability proof offer valuable insights that can be extended to other 
channels. 

In the BSC, the channel output Y; is a flipped version of X; with probability, say 
p (called crossover probability): p(y|x) = p when y 4 x. So the relation between 
X; and Ý; is given by: 


i= 


Xj w.p. 1 — p; 
XD 1l, wp. p. 
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Another simpler way to represent this is: 
Y; = X OZ; 


where Z;’s are i.i.d. ~ Bern(p). Without loss of generality, assume p € (0, 2); 
otherwise, we can flip all 0’s and 1’s to 1’s and O’s, respectively. 
Let’s start by claiming the channel capacity: 


Cesc = 1 — H(p) 


where H(p) := p log > + (1 —p) log = For the rest of this section, we will prove 
the achievability: if R < 1— H(p), one can make the probability of error arbitrarily 
close to 0 as n + 00. 


Encoder & decoder The encoder that we will employ is the same as before: the 
random code where X;(w)’s are i.i.d. ~ Bern(5) across all z’s and w’s. We will also 
use the optimal decoder, which is the ML decoder: 


Ww = arg max P(y"|x"(w)). 


The way that we compute the likelihood function is different from that in the 
BEC case. Let us see this difference through the following example. Suppose that 
(n, R) = (4, 5) and the codebook is: 


X"(1) = 0000; 
X”(2) = 1110; 
X”(3) = 1010; 
X"(4) = 1111. 
Suppose that the received signal y” is (0100). Then, the likelihood functions are: 
PO”) = (1 — p)°p's 
PO”) = (1 — p)? 
PO”) = (1 — p); 
PO”) = (1 —p)p”. 


Unlike the BEC case, all the messages are compatible because all of the likelihood 
functions are strictly positive. So we need to compare all the functions to choose 
the message that maximizes the function. Since we assume p € (0, $), in the above 
example, the first message is the maximizer. It has the minimum number of flips 
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(marked in red) and the flipping probability is smaller than 4, and hence, the cor- 
responding likelihood function is maximized. 

This reveals that the decision is heavily dependent on the number of non- 
matching bits (flips): 


A(x", y”) = Hi : xi A yit. 


This is called the Hamming distance (Hamming, 1950). Using this, the ML decoder 
can be re-written as: 


Wu = arg min d(x” (w), y”). 


Our remaining task is to demonstrate that the probability of error can be made arbi- 
trarily close to zero as n approaches infinity when utilizing the ML decoder. How- 
ever, we will not use the ML decoder for two reasons. Firstly, the analysis of error 
probability is somewhat intricate. Secondly, the proof method cannot be applied to 
arbitrary channels. How can we then prove the achievability without employing the 
ML decoder? Fortunately, there exists an alternative but suboptimal decoder that 
simplifies the proof of achievability significantly while still achieving 1 — H(p). 
Additionally, the suboptimal decoder is generalizable, which makes it suitable for 
proving the achievability. Therefore, we will utilize the suboptimal decoder to prove 


the achievability. 


A suboptimal decoder The suboptimal decoder that we will employ is inspired 
by the following observation. Notice that the input to the decoder is Y” and code- 
word is available at the decoder, i.e., X” (w) is known for all w’s. Now observe that 


Y” @ X” = Z” 


and we know the statistics of Z”: i.i.d. ~ Bern( p). 
Suppose that the actually transmitted signal is X” (1). Then, 


Y” X” (1) = Z” ~ Bern(p). 
On the other hand, for w Æ 1, 
Y” ® X” (w) = X” (1) BX" (w) PZ”. 


The statistics of the resulting sequence is Bern(ż). Notice that the sum of any two 
independent Bernoulli random variables is Bern(4), as long as at least one of them 
follows Bern(ż). Why? This motivates us to employ the following decoding rule. 


1. Compute Y” @ X” (w) for all w’s. 
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2. Eliminate messages such that the resulting sequence is not typical w.r.t. 


Bern(p). More precisely, let 


AY = {z”: 2—-n(A(p)+6) < pe”) < sore) 


bea typical set w.r.t. Bern(p). Eliminate all w’s such that Y"@X”(w) ¢ A”, 
3. If there is only one message that survives, then declare the survival as the 


correct message. Otherwise, declare an error. 


The error event is of two types. The first is the case where there are multiple sur- 


vivals. The second refers to the scenario where there is no survival. 


Probability of error Weare ready to analyze the probability of error to complete 


the achievability proof. For notational simplicity, let P, := 
same argument that we made in the BEC case, we get: 


Ec[P.(C)]. Using the 


P, := Ec[P.(C)] = P(W £ 1|W = 1). 


As mentioned earlier, the error event is of the two types: (i) multiple survivals; and 


(ii) no survival. The multiple-survival event implies that there exists w # 1 such 
that Y”® X” (w) € AP. The no-survival event implies that even Y” QX” (1) w.r.t. 
the correct message “1” is not a typical sequence, meaning that Z” ¢ AÙ. Hence, 


we get: 


P,=P(W 41|W = 1) 


lA 


wHl 


guk 


(a) 
< > PUY" @X"(w) € AM YW = 1) + PZ" ¢ AM |W = 1) 


w=2 


x (2"* — P(Y” @ X"(2) € AM |W = 1) 


P | UO” a Xw) e AM} U{Z" ZAM HW = 1 


(2.6) 


(278 — 1)P(Y” @.X"(2) € AM |W = 1) + P(Z” g AM |W = 1) 


where (a) follows from the union bound; (b) is by symmetry of the codebook; and 
(c) follows from the fact that Z” is a typical sequence w.h.p. due to the WLLN. 


Observe that 


Y” @ X” (2) = X” (1) a X” (2) @ Z” ~ Bern (;). 
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How to compute P(Y” @X”(2) € AY |W = 1)? To this end, we need to consider 
two quantities: (i) the total number of possible sequences that Y” @ X”(2) ~ 
Bern(5) can take on; and (ii) the size of the typical set Į” |. Specifically, we have: 


\A”| 


P(Y” @X"(2) € A |W = 1) = 
( ene) ) total number of Bern(4) sequences 


Note that the total number of Bern(4) sequences is 2” and 1A” | < 2” H (p)+€) 
(Why?). Hence, 

anlH(p)+6) 

P(Y” @ X"(2) € AM|W =1) < E 


This together with (2.6), we get: 


P, < 2”@-0-H(9)-0), 


Here € can be made arbitrarily close to 0. Hence, if R < 1 — H(p), P, > 0 as 
n — oo. This completes the proof. 


Look ahead Having established the achievability proof for the BSC, our next 
section will focus on expanding the proof of achievability to cover the general sce- 
nario in which the channel is characterized by any arbitrary conditional distribution 


Pls). 
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Problem Set 4 


Prob 4.1 (Concept of channel capacity) We wish to send a uniformly dis- 
tributed message W € {1,2,..., M} from a transmitter to a receiver. Let f and g 
be channel encoder and decoder respectively. Let R = nem be data rate and 

. = P(W + W) be the probability of error. Here n denotes code length and 
W indicates a decoded message. In an attempt to understand the fundamental 
tradeoff between the data rate R and the error probability P}, Shannon considered 


the following optimization problem. Given R and 7, 


PŽ (R, n) := min 2, (2.7) 
fg 


(a) Could Shannon solve the optimization problem (2.7)? 

(b) State the definition of possible (reliable) communication. 

(c) State the definition of an achievable rate. 

(d) State the definition of channel capacity using the concept of an achievable 
rate. 

(e) Consider a slightly different optimization problem: Given R, 


PUR) = min P.. 
fgn 


Now the design variables that we optimize over include code length z. Let C 
be the channel capacity. For any € > 0, what are P* (C — €) and P? (C+ €)? 


Prob 4.2 (Optimal decoding principle) Consider a binary random variable 
X ~ Bern(m). The signal X is transmitted over a binary symmetric channel with 
crossover probability p, yielding a channel output Y. Given Y = y, the receiver 
wishes to decode X so as to minimize the probability of error P, := P(X # Ê). 
Here X denotes a decoder output. This problem explores the optimal way of 
decoding X. 


(a) Compute the a posteriori probability: P(X = 1|Y = y). 

(6) Derive the optimal decoder, i.e., derive that yields the minimum P,. 

(c) The optimal decoder derived in part (b) is called the Maximum A Posteriori 
probability (MAP) decoder. Explain the rationale behind the naming. 


Prob 4.3 (Existence of an optimal deterministic code) Consider a memo- 
ryless binary erasure channel with erasure probability p. In Section 2.3, we claimed 
the achievability of the data rate 1 — p when employing a random code. Given 
R < 1 — p, Ec[P.(C)] can be made arbitrarily close to 0 as n —> oo. Here P.(C) 
indicates the probability of error when using a codebook C and the ML decoding, 
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and the expression is over a random choice of the codebook. In this problem, you 
are asked to show that the above statement implies the existence of an optimal 
deterministic code, say C*: P(C*) — 0 as n => œ. 


(a) Suppose X is a real-valued discrete random variable. Argue that there exists 
a value, say a such that P(X = a) 4 0 and a < E[X]. 
(b) Using part (a) or otherwise, show that the above statement for the achiev- 


ability implies the existence of a deterministic code C* such that given 
R<1-p,P.(C*) > Oas n > ov. 


Prob 4.4 (Chernoff bound) 


(a) Show that for a random variable X and some constant a, 


E[e**] 
P(X > a) < min 
A>0 


ela ` 
(b) Suppose X1, . . . , Xn are i.i.d., each being distributed according to Bern(m). 
Fix ô > 0. Show that 


P(X, +- +X, > n(m+ ô)) < 3 —nKL(m+6||m) 


where KL(m + ||) denotes the KL divergence between Bern(m + ô) and 
Bern(m). 
(c) Consider the same setup as that in part (b). Show that 


P(X, +--+ +X, < n(m -— ô)) < g—nKL(m—9]|m) _ 


Prob 4.5 (Channel modeling & channel capacity) Consider a memoryless 
channel which concatenates the following two channels serially: (i) a binary sym- 
metric channel with crossover probability p; and (ii) a binary erasure channel with 
erasure probability €. Let X and Y be the input and the output of the channel. 


(a) Derive the conditional distribution p(y|x). 
(b) Compute maxz) /(X; Y). 


Prob 4.6 (Maximum Likelihood decoding) The achievability proof for BSC 
that we learned in Section 2.3 uses joint typicality decoding. As mentioned in the 
section, the joint typicality decoding is not optimal in terms of minimizing the 
probability of error. In this problem, you are asked to prove the achievability when 
using the optimal decoder: maximum likelihood decoder (MLD). 

We consider a BSC with crossover probability p < L, Define the Hamming dis- 
tance d(x”, y”) between two binary sequences x” and y” as the number of positions 


where they differ, i.e., d(x”, y”) = {i : x; Æ yi} I- 
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(a) Show that the MLD rule reduces to the minimum Hamming distance 
decoding rule — declare w is sent if d(x” (ù), y”) < d(x” (w), y”) for all 
ww. 

(b) Using the random code (that we learned in Section 2.2) and the minimum 
distance decoder, show that for every € > 0, the probability of error is upper 
bounded as 


P,=P{W #1|W = 1} 
< P{d(x"(1), Y”) > n(pt+e)lW = 1} 


+ (2"* — 1)P{d(X"(2), Y”) < n(p+©|W = 1}. 


(c) Show that the first term in the RHS in the above inequality tends to zero 
as n —> oo. Using the Chernoff bound, show that 


P{d(X"(2), Y”) < n(pt+©|W = 1} < 2-70-49), 


Using these results, show that any R < C = 1 — H(p) is achievable. 


Prob 4.7 (Maximum Likelihood decoding) We wish to transmit a message 
W e {1,...,M} over a binary erasure channel with erasure probability p. Suppose 
we employ random encoding that we learned in Section 2.2. 


(a) State the MAP decoding rule. 

(b) Show that the MAP rule is optimal in a sense of minimizing the probability 
of error. 

(c) Show that the MAP rule reduces the ML rule when the message W is uni- 
formly distributed. 

(d) A student claims that given W = 1, the following events are disjoint: {W = 
PAN {W = M}. Prove or disprove it. Here W indicates the output of 
the ML decoder. 


Prob 4.8 (Random vs deterministic codes) Consider a channel coding 
problem setup in which code length n = 4 and data rate R = 7 Le W e€ 
(1, ..., 27%} and X” be the message and codeword respectively. Suppose that code- 
word X” is transmitted over a BEC with erasure probability p € [0, 1], thus yielding 
a received signal Y”. Assume we use an optimal decoder (i.e., the ML decoder in 
this problem setup). Let W be the decoded message. 


(a) Suppose we employ the random code in which components of code- 
words X;(w)’s are iid. ~ Bern(5) across both w € {1,...,2”%} and 
i e {1,...,}. Show that P(W = ¿|W = 1) is the same for all ż #1. 


112 Channel Coding 


(b) Now we use a deterministic code, say C, instead: 


X”(1) = 0000; 
X”(2) = 1100; 
X”(3) = 1110; 
X”(4) = 1111. 


A student claims that P(W = i|W = 1,C) is still the same for all ¿ Æ 1. 


Prove or disprove this claim. 
Prob 4.9 (Basic bounds) Let A and B be events. 


(a) Show that P(AU B) < P(A) + P(B). 
(b) Suppose that A implies B, i.e., whenever A occurs, B also occurs. Show that 
P(A) < P(B). 


Prob 4.10 (A chance of a random sequence being typical) Let {X;} be an 
iid. binary random process, each being according to Bern(5). Define a set 


AY := {z": gate) < plz”) < ge). 


where H(p) = plog $ + (1 — p) log = and p e (0, 0.5). 
(a) Show that 


\Ao” | 


P(X” e A™) = - 


(b) Show that 
P(X” e AP) < ge) 


Prob 4.11 (Concept of jointly typical sequences) Consider an i.i.d. sequence 
pair (X”, Y”) with p(x, y) where x € ¥ and y € Y. Here the i.i.d sequence pair 
means that (X;, Y;)’s are i.i.d. over ¿ and each follows the identical distribution 


p(x, y). Fix € > 0. Let 
APX) = {x": 27H (X)+6) < p(x”) < 27700-0) 
AP (Y) = y” : 2-n(A(Y) +6) < p(y”) < 27r- 


APOGI = i y AE p Ea AAR, 
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Show that for any € > 0, 


P(X” e AM (X)) — 1, 


P(Y” e AM (Y)) — 1, 


P(X”, Y”) € AM (X, Y)) — 1, 
as n — &. 


Prob 4.12 (Concept of jointly typical sequences) Consider an i.i.d. 
sequence pair (X”, Y”) with p(x, y) where x € V andy € Y. Fixe > 0. Let 


AY (X) = {x f 27H (X)+6) < p(x”) < 27H- 
A®(Y) = y” : 2 -nlA(Y) +6) < p(y”) < 27”) -0)) 


AY (X, Y) = {(x”,y”) : 2-nA(XY) +e) < p”, y”) < geet) -0 


A student claims that there exists a sequence pair of (X”, Y”) such that as n —> 00, 


P(X” e A®(X)) > 1; 


P(Y” e AP (Y) > 1; 
P(X”, ¥") e AP (X, Y)) > 1. 


Prove or disprove this statement. 


Prob 4.13 (Sum of Bernoulli random variables) Suppose that Xı ~ Bern( p) 
is independent of X% ~ Bern(3). What is the statistics of X, @ Xo? 


Prob 4.14 (True or False?) 


(a) Consider a memoryless binary erasure channel. Let X; and Y; be the input 
and the output of the channel at time /, respectively. Then, 


A(Y, Y2|X1,X2) = A(%|X1) + A(%|42). 


(b) Let {X;} be a binary random process such that PX, = 4,X2 = 
Dasa = hn) = a for all possible sequence patterns (71, i2, . . . , in). 
Then, {X;}’s are identically distributed, but not necessarily independent. 

(c) The Shannon’s landmark paper published in 1948 provides an explicit 
guideline as to how to design an optimal communication system. 

(d) Consider a binary asymmetric channel in which the output Y is a flipped 
version of X with probability pı € [0, 1] when X = 0 (and with probability 
p2 € [0,1] when X = 1). The capacity is achieved when X ~ Bern(1/2). 
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(e) 
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For an inference problem, the optimal decoder is always the MAP decoder. 


(f) Consider an inference problem in which we wish to decode X € ¥ from 


(g) 
(/) 


(Q) 


Y e y. Given Y = y, the optimal decoder can be: 


X= arg max P(Y = y|X = x). 
xEX 


Suppose that Xı ~ Bern( p), X% ~ Bern(4) and S = X; ® X2. Then, S 
follows Bern($) no matter what p is. 

In the channel coding setup, we assumed that a message W is uniformly 
distributed. The rationale behind this assumption is that we use an optimal 
source code. 

In Section 2.1, we considered an optimization problem which aims to min- 
imize the probability of error given data rate R and code length 7. Denote 
by P* (R, n) the minimum probability of error. Instead of deriving the exact 
P* (R, n), Shannon developed a lower bound of P* (R, n) to show that for 
any R < C, the probability of error can be made arbitrarily close to 0. 
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2.4 Achievability Proof for Discrete Memoryless 
Channels 


Recap In the previous section, we demonstrated the achievability proof for BSC. 
However, we will postpone the converse proof because the technique for the con- 
verse proof, which we will discuss later, can be directly applied to a broad range of 
problem settings. 


Outline This section will expand the proof of achievability to encompass the gen- 
eral scenario in which the channel is described by an arbitrary conditional distribu- 
tion p(y|x). Subsequently, in the following section, we will establish the converse 
proof for the general channel case. 


Discrete memoryless channel (DMC) The general channel that we will con- 
sider is called the discrete memoryless channel, DMC for short. Let us start with the 
definition of the channel. We say that a channel is an DMC if input and output 
are on discrete alphabet sets and the following condition is satisfied: 


Pilin x? t, y, W) = p(yilxi). 


This is called the memoryless property. Notice that given the current channel input 
xj, the current output y; is independent of the past input/output (x1, y7!) and 
any other things including the message W. Here one key property that we need to 
keep in our mind is: 


n 


PO”) = | [20i (2.8) 


i=1 


This can be proved by using the memoryless property. Check in Prob 5.4. This 
property plays a crucial role in proving the achievability as well as the converse. 
This will be clearer later. 


Guess on the capacity formula Let us guess the capacity formula for the 
DMC. Remember the capacity formulas of the BEC and BSC: Cgec = 1 — p; 
Cesc = 1 — H(p). These capacities are closely related to the key notion that we 
introduced earlier: mutual information. Specifically what we can easily show is: 
when X ~ Bern(), 


Ceec = 1 — p = I (X; Y); 
Cesc = 1 — H(p) = I(X; Y). 


116 Channel Coding 


Also one can verify that for an arbitrary distribution of X, 


Ceec = 1 — p > I(X;Y); 
Cesc = 1 — H(p) > I(X;Y). 


Check this in Prob 5.1. Hence, what one can guess on the capacity formula is: 


Come = max I (X; Y). (2.9) 
P(x) 


It turns out it is indeed the case. For the rest of this section, we will prove the 
achievability: if R < max,x) /(X; Y), the probability of error can be made arbi- 
trarily close to 0 as n > 0. 


Encoder The encoder that we will employ is a random code, meaning that 
X;(w)’s are generated in an i.i.d. fashion according to some distribution p(x). We 
know in the BEC and BSC that the input distribution is fixed as Bern(4). One can 
verify that Bern(+) is the maximizer of the above optimization problem (2.9). This 
motivates us to choose p(x) as: 


p (x) = arg max I (X; Y). 
P(x) 


Indeed, this choice enables us to achieve the capacity (2.9). 


Decoder Assume that the codebook is known at the decoder. We use the subop- 
timal decoder employed in the BSC case. Remember that the suboptimal decoder 
is based on a typical sequence and the fact that Y” X” = Z”. One significant dis- 
tinction in the general DMC case is that Y” is an arbitrary random function of X”. 
So what we can do is to take a look at pairs of (Y”, X”(w)) and to check if we can 
see a particular behavior of the pair associated with the true codeword, compared 
to the other pairs w.r.t. the wrong codewords. 

To illustrate this, consider the following situation. Suppose that X”(1) is trans- 
mitted, i.e., message 1 is the correct one. Then, the joint distribution of the correct 
pair (x”(1), y”) would be: 


pe"), 9") = pe" OO”) 


2 T eD) 


i=1 


= | [20y 


i=1 
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where (a) follows from the key property (2.8). This implies that the pairs 
(x;(1), y)’s are i.i.d over 7’s. Then, using the WLLN on the i.i.d. sequence of pairs, 
one can show that 


1 1 
l —> H(X,Y) in prob. 
n 0Y B 


Check in Prob 5.4. From this, one can say that 
pa), y”) x 27H (X,Y) 


for sufficiently large n. Why? 
On the other hand, the joint distribution of the wrong pair (x” (w), y”) for w # 1 
would be: 


P(X" w), y”) = pe" (w)) p(y”). 


This is because y” is associated only with (x”(1), channel noise), which is inde- 
pendent of x”(w). Remember that we use a random code in which codewords are 
independent with each other. Also one can verify that y” is i.i.d. Check in Prob 5.4. 
Again using the WLLN on x”(w) and y”, one can get: 


p” (w), y”) x 27H X)+H(Y)) 
for sufficiently large n. This motivates the following decoder. Let 
AM) = [ery . nH X)+6) < p") < gaH X)-—e) 
27H (Y)+e) < ply”) < 2-nA(Y)—€) 
—n(H(X,Y)+e) < n RY < —n(H(X)+H(Y)—e) 
2 S px y") <2 
The decoding rule is as follows: 


(x) 
€ . 


2. If there is only one message that survives, then declare the survival as the 


1. Eliminate all the messages w’s such that (x” (w), y”) ¢ A 
correct message. 
3. Otherwise, declare an error. 


Similar to the previous case, the error event is of two types: (i) multiple survivals 
(or one wrong survival); and (ii) no survival. 


Probability of error We analyze the probability of error to complete the achiev- 
ability proof. Using the same argument that we made in the BEC case, we get: 


P, := Ec[P.(C)] = P(W £ 1|W = 1). 
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As mentioned earlier, the error event is of the two types: (i) multiple survivals; 
and (ii) no survival. The multiple-survival event implies the existence of the wrong 
pair being a jointly typical pair, meaning that there exists w # 1 such that 
(X”(w), Y”) € a”. The no-survival event implies that even the correct pair is 
not jointly typical, meaning that (X”(1), Y”) ¢ A”, Hence, we get: 


P,=P(W 41|W =1) 


< P| Uw), Y”) e AM} U (KG), Y” EAM HW = 1 
wAl 
(a) gnR 
< $ PUR” w), Y”) € AM }|W = 1) + P(X”), Y”) ¢ AM |W = 1) 
w=2 


2”*P((X"(2), Y”) € AM |W = 1) + P(X" (01), Y”) g A |W = 1) 


= 2”*P((X"(2), Y”) e A |W = 1) 


where (a) follows from the union bound; (b) is by symmetry of the codewords w.r.t. 
message indices; and (c) follows from the fact that (X”(1), Y”)) is jointly typical 
for sufficiently large n w.h.p due to the WLLN. Observe that X”(2) and Y” are 
independent. So the total number of pair patterns w.r.t. (X”(2), Y”) would be: 


total number of (X” (2), Y”) pairs © pe ge) 


— H+H), 


On the other hand, the cardinality of the jointly typical pair set A” is: 


|A” | x nH (X.Y) 
Why? Hence, we get: 


A0?| 
total number of (X”(2), Y”) pairs 
QnA (%Y)—-A(X)—-A(Y)) 


P(Y” @ X"(2) e AM |W = 1) 


Qn 
277I (XY) 


2” 
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Using this, we get: 
P, < 2” R-1X:Y)) 


Hence, if R < I(X;Y), then P, > 0 as n > œœ. Since we choose p(x) such that 
I(X; Y) is maximized, max, ,) (X; Y) is achievable. This completes the proof. 


Look ahead So far we have proved the achievability for discrete memoryless 
channels. In the next section, we will prove the converse to complete the proof 
of channel coding theorem. 
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2.5 Converse Proof for Discrete Memoryless Channels 


Recap In the previous section, we have proven the achievability for discrete mem- 
oryless channels which are described by conditional distribution p(y|x): 


R < max/(X; Y) => P, > 0. 
P(x) 

Outline In this section, we will prove the converse to complete the channel coding 

theorem: 


fgg mr ey) 
pr 


The proof consists of three parts. Firstly, we will examine Fano’s inequality, a fun- 
damental inequality that plays a crucial role in the converse proof. Secondly, we 
will explore another critical inequality known as the data processing inequality 
(DPI). Lastly, utilizing both these inequalities, we will present the final proof of 
the converse. 


Fano’s inequality Fano’s inequality is a significant inequality in the context of 
inference problems, which involve inferring an input from an output that is prob- 
abilistically related to the input. In such problems, the only possible action with 
respect to the input is to make an inference or guess. The communication problem 
we have been studying can also be viewed as an inference problem, where the goal is 
to infer the message W from the received signal Y”, which is stochastically related 
to W. In this context, Fano’s inequality relates the following two quantities: 


P.:= P(W 4 W) & H(W|W). 


One can expect that the smaller P,, the smaller H(W|W). Fano’s inequality 
presents how one affects the other in a precise manner. In the converse proof, 
we need to establish a lower bound on P, to show that P, does not converge to zero 
when R > C. Fano’s inequality plays a crucial role in providing this lower bound 
by giving an upper bound on H(W|W) expressed in terms of P,. The formula for 
this upper bound is well-known, and we will focus on its expression: 


Fano’ inequality: H(W\W) < 1 + P,-nR. (2.10) 


This indeed captures the intimate relationship between P, and H(W| W). The 
smaller P,, the more contracted H(W |W). The proof of this is simple. Let 


E=\UWs W}. 
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By the definition of the error probability and the fact that E1{W + W}] = 
P(W 4 W), we see that E ~ Bern(P,). Starting with the fact that EF is a function 
of (W, W), we have: 


H(W|W) = H(W,E£|W) 


2 HEI) + H(WIW, E) 


(b) x 
<1+H(W|W,E) 


O14 PE =1)-H(W|W,E=1) 
=14+P,-H(W|W,E=1) 

(d) 

<1+P.-nR 


where (a) is due to a chain rule; (4) follows from the cardinality bound on H (E| W); 
(c) comes from the definition of conditional entropy; and (d) follows from the car- 
dinality bound on H(W|W, E = 1). 


Data processing inequality (DPI) The next inequality we will examine is 
called the data processing inequality (DPI). Essentially, DPI states that any pro- 
cessing of data cannot enhance the quality of inference beyond what was originally 
available. In the case of our problem, this means that the quality of inference 
of W based on X” (which we can view as the original data) cannot be infe- 
rior to that based on processed data, such as Y”. DPI is a mathematical state- 
ment formulated in the context of a Markov process. Therefore, let us first study 
what a Markov process is. We say that a random process, say {Xj}, is a Markov 
process if 


Pea hs Mais na ht) = P|) 


The meaning of this condition is that, given the current state x;, the future state x41 


and the past states x’—!’s are independent of each other. A Markov process is often 
represented graphically using a well-known diagram. For example, if (X1, X2,X3) 


form a Markov process, it can be represented as: 
pe coon ee 


The reason for representing a Markov process in this way is as follows. If we are 
given X2, we can remove it from the graph since it is already known. This removal 
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results in X3 and X; being disconnected, indicating that they are statistically inde- 
pendent. Since the resulting graph resembles a chain, it is referred to as a Markov 
chain. 

Our problem context exhibits a Markov chain. One can show that (W, X”, Y”) 
forms a Markov chain: 


W- xX” - yY”. 


The proof is straightforward. Given X”, Y” is a sole function of the noise induced 
by the channel. Since the noise has nothing to do with the message W, (Y”, W) 
are independent of each other. 

DPI is defined w.r.t. the Markov chain. It captures the relationship between the 
following quantities: /(W;X”), [(W; Y”), I(X”; Y”). In fact, (W; X”) represents 
the common information shared between W and X”, which can be seen as the 
quality of inference on W based on X”. Similarly, (W; Y”) represents the quality 
of inference on W based on Y”. The verbal statement of DPI states that the quality 
of inference on W cannot be improved by processing the original data X” to obtain 
Y”. In terms of mutual information, this can be expressed as follows: 


I(W; Y”) < 1(W;X”). 


The proof of this is simple. Starting with a chain rule and applying non-negativity 
of mutual information, we have: 


I(W; Y”) < I(W; Y”, X”) 
= I(W; X”) + I(W; Y"|X") (2.11) 
2 w;x”) 


where (a) follows from the fact that W — X” — Y”. There is another DPI w.r.t. 
I(W; Y”) and another mutual information 7(X”; Y”). Note that X” is closer to 
Y” relative to the distance between W and Y”. Hence, one can guess: 


I(W; Y”) < I(X”; Y”). 
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It turns out this is indeed the case. The proof is also simple. 
I(W; Y”) < I(W,X”; Y”) 
= 1(X"; ¥") + 1(W; Y"|X”) (2.12) 
= 1(X";¥”), 


From (2.11) and (2.12), one can summarize that mutual information between two 
ending terms in the Markov chain does not exceed mutual information between 
any two terms that lie in-between the two ends. 

Looking at our problem setting, there is another term: W. This together with 
the above Markov chain (W — X” — Y”) forms a longer Markov chain: 


W—Xx"-y"_wW. 


Given Y”, W is completely determined regardless of (W, X”) because it is a func- 
tion of Y”. Now applying DPI, one can verify that 


1(W;W) < 1(W;Y”); (2.13) 
I(W; W) < 1(W;X”); (2.14) 
IW; W) < 1(X"; W); (2.15) 
I(W; W) < I(X”; Y®; (2.16) 
IW; W) < 1Y”; W). (2.17) 


Converse proof We are ready to prove the converse with the two inequalities. 
Starting with the fact that nR = H (W), we have: 


nR = H(W) 
=1(W;W) + H(W|W) 


(a) a 
$ (W; Ŵ)+1+P -nR 


O WF) + ne, 


O) 
< I(X”; Y”) + ne, 
= H(Y”®) — H(Y"|X") + nen 


2 HY” — SL HYIX) + ne, 


i=1 
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© IA) - HYAX)) + nen 
i=1 


= > Y;) + nén 
i=1 


(f) 
< nC nen 


where (a) follows from Fano’s inequality (2.10); (4) comes from the definition of 
€, that we set as: 


1 
En i= -(1 + nPR); 
n 


(c) is due to DPI (2.16); (d) follows from the memoryless channel property 
(o(Y"1X") = TTL, p(VAX) and hence H(Y"|X") = Ei HY;AX)): (© con- 
ditioning reduces entropy; and (f) is due to the definition C := maxs) /(X; Y). 
Dividing by 7 on both sides, we get: 


R< Cen 
If P, > 0, €, := Ae + P.nk) tends to 0 as n — oo. Hence, we get: 
R= G, 
This completes the proof. 


Look ahead The source and channel coding theorems have been proven within 
Shannon's two-stage communication architecture, which does not permit inter- 
action between the source and channel codes. However, a question arises as to 
whether this separation approach is optimal and can achieve the same performance 
as the general architecture that allows for cooperation between the two codes. 
Surprisingly, the answer is yes, and we will prove this in the next section. 
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2.6 Source-Channel Separation Theorem and Feedback 


Recap So far, we have studied Shannon’s two fundamental theorems: the source 
and channel coding theorems. These theorems are established under a specific 
two-stage architecture, as shown in Fig. 2.2. Some readers may be curious about 
what would happen if the architecture was arbitrary and allowed for any interac- 
tion between the source and channel codes. Can we achieve better performance 
with potential cooperation between the two codes? This was a question that Shan- 
non himself raised in his landmark paper of 1948 (Shannon, 2001). Surprisingly, 
he answered this question negatively by establishing the source-channel separation 
theorem. This result is both surprising and important. It is surprising because the 
simple separation approach is optimal, and it is important because it forms the 
foundation of the digital communication architecture, where the digital interface 
operates independently with the source code block. 


Outline In this section, we will present the proof for the source-channel separa- 
tion theorem. The proof consists of three parts. Firstly, we will identify a condition 
that the separation approach relies upon for reliable transmission of an information 
source, which will serve as a sufficient condition for reliable transmission. Secondly, 
using the two critical inequalities, Fano’s inequality and data processing inequal- 
ity, introduced in the previous section, we will demonstrate that the condition is 
also necessary, thus establishing the optimality of the separation approach. Finally, 
we will explore a distinct topic that has a close technical connection with the two 
inequalities, namely the role of channel output feedback, and provide a detailed 
analysis of this topic. 


What to prove for the optimality of the separation approach? As the 
separation approach is a specific communication scheme, a condition that guaran- 
tees reliable transmission of an information source using this approach can serve 
as a sufficient condition. Therefore, proving that this condition is also necessary 
would establish the optimality of the separation approach. In the sequel, we will 


information 
source 


encoder M> decoder 


channel 
encoder 


source 
encoder 


Figure 2.2. Shannon’s two-staged architecture for communication systems. 
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ve a 
A : joint joint 
information J J 
source-channel source-channel ~> 
source 
encoder decoder 


Figure 2.3. Joint source-channel encoder and decoder. 


first come up with a sufficient condition due to the separation approach, and then 
P P pp 
prove its necessity accordingly. 


A sufficient condition due to the separation approach Suppose we want 
to transmit & source symbols using 7 channels, as shown in Fig. 2.3. An important 
question is: under what conditions can the separation approach reliably transmit 
k symbols? This condition can be identified using the source and channel cod- 
ing theorems. The source coding theorem states that the entropy rate H (in bits 
per symbol) is the minimum number of bits required to represent the information 
source per symbol. The channel coding theorem states that the capacity C (in bits 
per channel use) is the maximum number of bits that can be reliably transmitted 
over a channel. We can represent the entropy of the & source symbols as kH, and 
the total number of bits that can be transmitted using 7 channels as nC. If we apply 
the source and channel coders separately, a sufficient condition would be: 


kH < nC. (2.18) 


Necessity of the sufficient condition (2.18) We will prove that the condi- 
tion (2.18) is also necessary by using Fano’s inequality and DPI. To determine what 
needs to be proven precisely, let us first express several quantities that arise in the 
problem setup. We denote by V* the k source symbols. X” is the sequence fed into 
the channel, and Y” is the channel output. V* is the decoded source. In contrast to 
the separation approach setup, V* is the output of a joint source-channel decoder 
that allows for any interaction across source and channel codes. This is because we 
intend to prove the necessity of the condition (2.18) under any arbitrary scheme. 


The probability of error is defined as: 
P.=P(V! ZV"). 


What we wish to show is that for reliable communication, i.e., Pe > 0, the condi- 
tion (2.18) must hold. 
We will prove its necessity for a stationary random process where 


H(i, ....Vin 
f= tag Vi 


m—- CO m 
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AVi.Vm) To see this, let us 


A stationary process has an interesting property on m 


first massage this term as: 


H(V1,...;Vm) 1s = 
= H(V;\V7—'). 
— AWAY") 


m 
i=1 


This is because of a chain rule. Let a; := H(V;|V’~!). Using the stationarity and 
the fact that conditioning reduces entropy, 


4i = di—]. 
Keeping this in our mind, compare the following two terms: 


1 m 1 m+1 
— > Ai VS —— ) aj. 
m m+ 1 
i=1 
Since a; is non-increasing in 7, + 5°”, a; is also non-increasing in m: 
1 8 > m 1=1 “z 8 i 


m m+1 


Hence, for any positive integer k, 


i H(Vi,...; Vm) . H(Y,...5 Ve) 
im . 


m— CO m k 


Using this property, we have: 


H Ja pi m 
kH =k lim “i Vn) 
m— o0 m 
i H (Vi, ..., Ve) 

T k 

=A") 
Starting with this and applying the definition of 1(V*;V*) := H(V*) — 
H(V*|V*), we then get: 

kH < H(V*) 


<1(V*;V*) + A(V*|V*). 
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We are now ready to employ Fano’s inequality and data processing inequality. To 
be specific, Fano’s inequality reads: 


H(V*|V*) < 1+ &P, log |V| =: hep. 
DPI that we want to use in this context is: 
1(V*; V*) < IX”; Y^. 
Applying these into the above, we obtain: 
kH < H(V*) 
SUV aos 
< I(X;Y") + he, 
< H(Y") — H(V" |X”) + kez. 
Manipulating further, we get: 
kH < H(Y") — H(Y"|X") + he, 
2 D HYNY) — AVX") + ker 


Ê S HY) — HUVAX)] + ker 


SIX; Yi) + ker 


IAS 


nC + ke, 


where (a) comes from a chain rule; (4) follows from the memoryless property of 
DMC and the fact that conditioning reduces entropy; and (c) is due to the defini- 
tion of C := max,x) /(X; Y). As k + 00, €p > 0. Hence, we get: 


n 
H < -C. 
Tk 


This completes the proof. 


Discrete memoryless channel with feedback Next, we will discuss another 
topic that is related to Fano’s inequality and DPI in a technical sense. The topic 
we will cover is the role of channel output feedback, which was studied by Shan- 
non previously. Feedback has proven to be valuable in numerous areas, particularly 
in control. Feedback is known to have a significant role in stabilizing systems. In 
communication, however, feedback has not been very useful. This is mainly due 
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Figure 2.4. A discrete memoryless channel with feedback. 


to Shannon’s original result in the 1950s (Shannon, 1956). What Shannon showed 
is that feedback cannot increase the capacity. Hence, feedback has been used to 
verify the success of transmission. For the rest of this section, we will explore this 
counter-intuitive result in depth. 

We first introduce a channel model that Shannon considered. It is based on the 
memoryless channel which respects: 


pile xt, y, W) = pOilxi). 


We consider channel output feedback where the past channel output is fed back to 
the encoder. See Fig. 2.4. Hence, a transmitted signal X; at time 7 is a function of 
the message W and the past channel output Y*~! := (Yj,..., Y;—1). Under this 
model, what Shannon showed is that the feedback capacity Crp is the same as the 
non-feedback capacity: 


Crp = Cno, (2.19) 


meaning that feedback cannot increase the capacity. 


Proof of (219) The initial procedure for the converse proof is the same as that 
of the non-feedback case. We start with: 


nR = H(W) 
=1(W;W) + H(w|W) 
(a) A 
< 1(W; W) + nen 


where (a) is due to Fano’s inequality and €, := ae + nRP,). The next step is to 
employ DPI to make a progress w.r.t. /(W; W). Remember in the non-feedback 
case that we used the following DPI: 


IW; W) < I(X;Y”). 
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We then expressed /(X”; Y”) as H(Y”) — H(Y"|X”). The conditional entropy 
H(Y"|X”) was expressed as: 


H(Y"|X") = $ A(YAIX,), (2.20) 


i=1 


and it played a crucial role to prove the converse. The above (2.20) was because of 
the key memoryless property: 


pO" lx”) = | [pGilx. 


i=1 


It holds in the non-feedback case. See Prob 5.4 for the proof. In the feedback case, 
however, it does not hold any more. Hence, we should take a different approach to 
prove the converse. 

Even in the feedback case, what we know for sure is about a Markov chain rela- 
tionship, which reads: 


(W,X®) -Y - Ww. 
This then yields the following DPI: 
I(W; W) < (W; Y”). 
Applying this different DPI to /(W; W) in the above, we get: 


nR < I(W; Y”) + nen 
2 wy) — HYAY}, W)] + nen 
D SHAY D — HAY, W, X)] + nen 


a) SAMY) — A(Y;|X;)] + nen 


2 SHY) — H(VIX)) + ne, 
= > 10 Y;) + ne, 


(Q) 

< nCno + nEn 
where (a) is due to the definition of mutual information and a chain rule; (4) 
follows from the fact that X; is a function of (W, Y^!) and adding a function 
in conditioning does not alter entropy; (c) follows from the Markov property 
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(W, Y5!) — X; — Y; (d) comes from the fact that conditioning reduces entropy; 
and (e) is because of the definition Cno := maxp(x) I (X; Y). 
The above gives 


R< Cno + En. 
Under reliable communication, €, —> 0. Hence, we prove: 
R< Cro. 


Look ahead Up to this point, we have proven the source coding theorem, the 
channel coding theorem, and the source-channel separation theorem for a broad 
range of channels, specifically the discrete memoryless channel. We also explored 
the technical connection between the role of feedback and the converse proof. How- 
ever, there is a topic that has not been thoroughly addressed in the channel coding 
theorem, which pertains to the achievable scheme. During the achievability proof, 
we established the existence of an optimal code that can achieve P, close to 0 as 7 
approaches 00, but we did not discuss how to construct the optimal code. Moving 
forward, we will delve into deterministic codes that are explicit and approach or 
even reach the capacity. 
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Problem Set 5 


Prob 5.1 (Channel capacity) Let X and Y be the input and the output of a 
discrete memoryless channel, respectively. 


(a) Suppose the channel is a BEC with erasure probability p. Show that 
IX: Y)<1-p. 


Also derive the condition under which the equality in the above holds. 
(4) Suppose the channel is now a BSC with crossover probability p. Show that 


I(X;Y) < 1- H(p). 


Also derive the condition under which the equality in the above holds. 
(c) In Section 2.4, we learned that the capacity of a discrete memoryless chan- 
nel is 


C = max I (X; Y). 
px) 


Show that /(X; Y) is a concave function in p(x). 


Prob 5.2 (Computation of channel capacity) Let S$; ~ Bern(q) and S2 ~ 
Bern(5). Let X = Sı ® Sp be an input to a BSC with crossover probability p and 
Y denote an output of the channel. Suppose Sı and Sz are independent of each 
other. Compute /(X; Y). 


Prob 5.3 (Typical versus non-typical sequence pairs) Consider a dis- 
crete memoryless channel. Suppose X” are i.i.d., each being generated as per 
px(x). Let T” be another i.i.d. random process, being independent of X” yet 
each being generated according to the same px(-). Let Y” be the output of the 
channel when X” is fed into. Explain the limiting behavior of lim„— oo 1 log 


1 
px”, yn (T”, Y”) 
Prob 5.4 (Jointly typical sequence) Consider a discrete memoryless channel. 


Suppose that the encoder uses a random code in which X;(w)’s are i.i.d. ~ p(x) 
across 7 € {1,..., n} and w € {1,..., 27%}. 


(a) Show that for w € {1,...,2”%}, 


PO” W) = | [ pGilxi(w)). 


i=1 
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(b) Suppose that X”(1) is transmitted. Show that 


in prob. 
— H(X,Y) an> oœ. 


1 
lo 
n E px), Y”) 
Also show that for € > 0, as n > ov, 
(c) Suppose that X”(1) is transmitted. A student claims that as in part (b), 


in prob. 
— H(X,Y) an> oœ. 


1 
log 
n ” p(X"(2), Y”) 
Prove or disprove this statement. 


(d) Show that Y” is i.i.d. 


Prob 5.5 (Fano’s inequality) We wish to transmit a message W € {1,..., ae) 
over a discrete memoryless channel. Let W be a decoded message at a receiver. Let 
E=\(Ww + W} where 1{-} denotes an indicator function which returns 1 if the 
argument is true; 0 otherwise. Let P, := P(W # W). Show that 


H(W|W) = H(£|W) +P.-H(W|W,E = 1). 
Prob 5.6 (Data processing inequality) Suppose a random process {X;} satis- 
fies: for all z > 1, 
Pxitalxiti. Xi - + .%1) = pili xi). 


Let S; := (Xj41,X;). A student claims that 7 (S1; $2) > Z(S1; S3). Prove or disprove 
the claim. 


Prob 5.7 (Source-channel separation theorem) We wish to transmit an 
information source of a stationary process V% over a DMC. Let X” be the output 
of an encoder fed by V*, and Y” be the output of the DMC when X” is fed into. 
Let V* be the decoded information source at a receiver. Let C := max,(x) I (X; Y) 
and P, := P(V* # V*). 


(a) Using the source coding and channel coding theorems, show that if 
kH (V) < nC, (2.21) 


then P, can be made arbitrarily close to 0 as n —> oo. Here H (V) denotes 
the entropy rate: 
H(V*) 

EO 


T 
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(b) Show that for an integer m > 0, 


H(V”) : H(V”+}) 
m ` m+1l ` 


Hint: H(V™) = X? H(V;| V^!) and H(V;| V7!) is a non-increasing 
sequence in 7. 

(c) Prove: (i) Fano’s inequality H (V*| V*) < 1+ P,- klog|V|; and (ii) 
data processing inequality [(V*; V*) < I(X”; Y”). Here V denotes the 
range of V. 

(d) Using parts (b) and (c), show that if we want to make P, — 0, then 
RH(V) < nC must hold. 

(e) Does the condition (2.21) serve as the sufficient and necessary condition for 
reliable communication, even when using an encoder/decoder that doesn’t 
follow Shannon's two-stage architecture? Based on the answer, can we deter- 
mine the optimality of the two-stage architecture? Is the two-stage architec- 
ture lossless in terms of optimality, meaning that any data rate achieved 
with an arbitrary encoder/decoder can also be achieved using the two-stage 
architecture? 


Prob 5.8 (Source-channel separation theorem) We wish to encode i.i.d. 
V” ~ Bern(4) for transmission over a binary erasure channel with erasure prob- 
ability €. Find the necessary and sufficient condition under which the probability 
of error P(V” # V”) can be made arbitrarily close to 0 as n 4 00. 


Prob 5.9 (Capacity of a composite channel) Suppose that two binary sym- 
metric channels (BSCs) with crossover probabilities pı and p2 respectively are con- 
nected end-to-end to form a composite channel. Let X? and Yj’ (or Xj’ and Y;’) 
indicate the input and output of the first (or second) BSC. Here 7 denotes the code 
length. 


(a) Suppose no operation is allowed between the two BSCs, i.e., X} is simply 
set as Yf’. Compute the capacity of this composite channel. 

(b) Suppose any operation is allowed between the two BSCs, i.e., X? can now 
be an arbitrary function of Y}. Compute the capacity of this composite 
channel. 


Prob 5.10 (Capacity of the union of two channels) Consider a discrete 
memoryless channel which is the union of the following two channels: (i) a binary 
symmetric channel with crossover probability p; and (ii) an erasure channel with 
erasure probability €. At each time, one can send a symbol channel 1 or channel 2 
but not both. Find the capacity of this channel. 
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Prob 5.11 (Capacity of a composite channel) Let X, Y, Z be three discrete 
random variables defined on ¥, Y and Z respectively. Suppose that p(y|x) is a 
DMC with input in X, output in Y and capacity C1; and g(zly) is another DMC 
with input in Y, output in Z, and capacity C2. Consider the following DMC 


r(alx) = $ gep) 
yey 


with input in ¥ and output in Z. A confused information theorist claims that the 
capacity of this DMC is min(C}, C2). Either prove or disprove this statement. 


Prob 5.12 (Channel capacity and achievability) We wish to transmit a uni- 
formly distributed message W € {1,...,2”*} over a memoryless binary erasure 
channel with erasure probability p. Here 7 denotes the code length and R := log “ 
where M indicates the cardinality of the range of W. Let f and g be channel encoder 
and decoder respectively. Let P, = P(W + W) be the probability of error where 
W indicates a decoded message. Let C be the capacity of this channel. 


(a) Given R, define: 


PŽ(R): Lee 
For € > 0, compute P*(C — €) and P?(C + €). 

(b) Show that C < 1 — p. 

(c) In Section 2.2, we intended to prove the achievability of 1 — p. To this 
end, we employed a random encoder in which X;(w)’s are generated i.i.d. 
according to p(x) for i € {1,2,...,n} and w € {1,2,...,2”%}. What was 
the choice of p(x)? 

(d) Given Y” = y”, derive an optimal decoder. What is the name of the opti- 
mal decoder? Also explain the rationale behind the naming. We say that a 
decoder is optimal if it minimizes the probability of error. 

(e) It turns out that under the random encoding in part (c) and the optimal 
decoder in part (d), P, can be made arbitrarily close to 0 as n > 0. Show 
that this implies the existence of an optimal deterministic code, say C* such 
that given R < 1 — p, P,(C*) — 0 as n > œ. 


Prob 5.13 (Role of feedback) This problem explores the role of the channel 
output feedback in a DMC. We wish to transmit a message W € {1,2,...,2”*} 
over the DMC. Let X” be a transmitted signal and Y” be the channel output. 
Unlike the conventional non-feedback setting, we assume that the past channel out- 
put Y^! is available at the encoder at time i; hence, X; is a function of (W, Y a 
Let W be the decoded message. 


136 Channel Coding 


(a) Show that (W, Y”, W) form a Markov chain, i.e., W — Y” — W. 

(b) Using part (a), show that /(W; W) < 1(W; Y”) (data processing inequal- 
ity). . 

(c) Prove that H(W|W) < 1 + P,- nR (Fano’s inequality). 

(d) Using the definition of memoryless channels and the fact that X; is a func- 
tion of (W, Y^ !), show that 


HO”IW) = $ HOX). 


i=1 


(e) Using parts (b), (c), (d), prove that the capacity of this feedback channel is 
still C := max,x) (X; Y), meaning that feedback cannot increase capacity. 
Since the achievability readily comes from the nonfeedback scheme, you 
only need to prove the converse: if P, > 0, then R < C. 


Prob 5.14 (Capacity of a cascaded channel) Consider a cascade of 7 identi- 
cal binary symmetric channels (BSCs), each with crossover probability p € (0, 1). 
Assume that no operation is allowed in between any two BSCs. Compute the capac- 
ity of this cascaded channel. 


Prob 5.15 (Capacity of a cascaded channel) Consider a discrete memory- 
less channel which concatenates the following two channels serially: (i) a binary 
symmetric channel with crossover probability p; and (ii) an erasure channel with 
erasure probability €. Let X and Y be the input and output of the channel. 


(a) Derive the conditional distribution p(y|x). 
(b) Find the capacity of this channel. 


Prob 5.16 (Markov chain) Suppose 
A= Y= (2, W). 
(a) Show that 
A=AY 2) = W. 


(6) Find IX; WIY). 


Prob 5.17 (Capacity of the union of two channels) Define the probability 
transition matrix P; for a discrete memoryless channel (DMC) as a matrix whose 
(x,y) entry is the probability that the output of the channel is y given the input 
x. Consider two DMCs, called DMC, and DMC, with transition matrices Pı 
and P, respectively. Consider a third DMC, say DMC3, whose transition matrix is 
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P 0 
0 PJ 


Define a selector S that selects the channel to which a symbol is transmitted. 


given by 


(a) Show that 
I(X;Y) = I(X, S; Y). 
(6) Show that the capacity of DMC; is given by 
C3 = log(2^ + 2@), 
where C; is the capacity of DMC;j. 
Hint: Use part (a). 


Prob 5.18 (Capacity of an erasure channel with two erasures) Consider 
a DMC with binary input X e {0,1}, output Y € {0,e9,¢1,1}, and channel 
probabilities: 


p010) = pl) =1-p-q 
pleo|0) = plaill) = p, 
pleil0) = p(eoll) = 4 

where p > q > 0. Find the capacity of this DMC. 


Prob 5.19 (Quantum channel) Alice wishes to transmit a single bit X ~ 
Bern(5) to Bob. There is an intruder Eve who intends to interfere with the commu- 
nication. With probability 1 — p, Eve does nothing, so Bob receives X. With prob- 
ability p, however, Eve intervenes in the communication between Alice and Bob. 
With probability p, Eve intercepts the transmitted bit in a possibly noisy manner. 
The intercepted bit Z is: 


X, w.p. 
Z= 
X+N, wp. 


NI= NI = 


where N ~ Bern(4), independent of X. Eve then resends Z to Bob. Hence, a 
received signal Y at Bob reads: 


X, w.p.1—p; 
y= P P 
Z, W.p. p. 
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(a) Eve wishes to decode X given its received signal. Remember that Eve gets 
nothing w.p. 1 — p and gets Z w.p. p. Let Xz be the optimal detector output 
w.r.t. X. Compute P(r £ X). Also compute /(X; Xp). 

(b) Similarly Bob wants to decode X given Y. Let Xp be the corresponding 
optimal detector output. Compute PÈ} £ X). Compute 7 (X; Šp). 

(c) Find a condition on P(g 4 X) such that (X; Šp) > I(X; Šp). 


Prob 5.20 (True or False?) 


(a) Consider a discrete memoryless channel with input X; and output Y;. Then, 
AY, Y2|X1,X2) = H(i X1) + HM%). 


(b) Consider a channel coding problem setup where the channel is a BEC with 
erasure probability 0.11. A hard-working student claims that she can come 
up with a transmission scheme that achieves 100 bits of reliable transmission 
with code length 1000. Does the claim make sense? 

(c) Let Vi, V2,..., Vp be a finite alphabet i.i.d source. The information source 
is encoded as a sequence of n input symbols X” of a memoryless binary 
erasure channel. Let Y” be the output of the channel. Given Y” = y”, we 
wish to decode X”. The optimal decoder is: 


"= arg max P(Y” =) A = x”). 


(d) Consider a discrete memoryless channel (DMC) with probability transition 
probability: 


plx) = 
0 0 l-@q q 


where rows and columns correspond to values of x and y, respectively. The 
capacity of this channel is C = 2— H ( p) — H (q) where H(p) := plog 3 + 
(1 — p) log +, 

(e) Consider an additive channel whose input alphabet ¥ = {0, +2, +4} and 
whose output Y = X+Z. Here Z is distributed uniformly over the interval 
[—1, 1]. The capacity of this channel is log 3. 

(f) Consider a DMC with feedback in which the past channel output is avail- 
able at encoder: an encoded signal X; at time 7 is a function of (W, ye"), 
Here W indicates a message W € {1,2,... ,2”R}, Y; denotes channel out- 
put at time i, and Y^! := (Yj,..., Y;—1). Then, W — X” — Y”. 
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(g) 


(2) 


(2) 
G) 


($) 


(€) 


Suppose a random process {X;} satisfies: for all 7 > 1, 


plies Wes Xi Xi—1> -< 9 X1) = P(xi+2lXi41 xi). 


Let S; := (Xi+1, Xj). Then, I(Si; S3) > I(Sz; S4). 

Suppose that two binary symmetric channels with crossover probabilities p1 
and p2 respectively are connected end-to-end to form a composite channel. 
Suppose that any operation is allowed between the two channels. Then, the 
capacity of the composite channel is 1 — max{H (p1), H(p2)}. 

Suppose random variables (X1, X2, X3, X4, X5) form a Markov chain. Then, 
I(X; X) > IX; X5). 

Shannon created a communication architecture that separates source cod- 
ing and channel coding, allowing them to be performed independently 
of each other. This independent design facilitates the standardization of 
the digital interface and simplifies implementation. However, this comes 
at the expense of performance degradation resulting from the separation of 
the architecture. 

Let X, Y, Z be three discrete random variables defined on ¥, VY, Z respec- 
tively. Suppose that p(y|x) is a discrete memoryless channel (DMC) with 
input in X, output in Y and capacity C1; and q(z|y) is another DMC with 
input in Y, output in Z, and capacity C2. Consider the following DMC 


rel) = >> g(zly)pClx) 
yey 


with input in ¥ and output in Z. This DMC can have capacity greater 
than max(Cj, C2). 

Let Yı and Y be conditionally independent and conditionally identically 
distributed given X. The capacity of a channel with input X and output 
(Y1, Y2) is twice the capacity of a channel with input X and output Yj. 
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2.7 Polar Code: Polarization 


Recap In Part II, we proved the channel coding theorem and source-channel sep- 
aration theorem for discrete memoryless channels. To achieve this, we used a ran- 
dom coding argument in the proof of the channel coding theorem to establish the 
existence of an optimal code. However, we did not discuss how to construct an 
explicit and practical code that guarantees optimality. 


Outline In the remainder of Part II, we will delve into a specific deterministic 
code called the polar code, which achieves channel capacity for a class of channels. 
This section is divided into four parts. Firstly, we will share an interesting backstory 
about the random code utilized in the achievability proof of the channel coding 
theorem. Following that, we will discuss two major endeavors focused on explicit 
code constructions, with emphasis on the polar code. Then, we will highlight a 
crucial feature of the polar code known as polarization, which brings about a fas- 
cinating phenomenon. Finally, we will examine the polarization-inspired encoding 
and decoding methods to describe how the polar code operates. 


Initial reactions on Shannon’s channel coding theorem When Shannon 
first introduced his channel coding theorem, it was met with mixed reactions, par- 
ticularly from communication systems engineers. There were three main reasons 
for this. Firstly, many engineers did not comprehend the concepts of reliable com- 
munication, achievable data rate, and capacity, which made it difficult for them to 
understand the theorem. Secondly, even those who understood the theorem had a 
negative outlook on the development of an optimal code. The achievability proof 
of the theorem suggested that the optimal code required a lengthy code-length to 
attain capacity, which seemed complex and unfeasible given the technology of the 
time. Lastly, even the optimistic engineers who saw the potential for implementa- 
tion with future technology were not confident because Shannon did not provide 
a concrete method for designing an optimal code. The proof only established the 
existence of optimal codes without specifying how to construct them. 


Two major efforts Due to these reasons, Shannon and his supporters, including 
some intelligent MIT folks, made significant efforts to develop explicit and deter- 
ministic optimal codes with potentially low implementation complexity. Unfortu- 
nately, Shannon himself failed to develop a good deterministic code. Instead, his 
MIT supporters came up with some successful codes, and in this section, we will 
discuss two of their major efforts. 

One of the major efforts was made by Robert G. Gallager, who developed the 
“Low Density Parity Check code” (LDPC code for short) in 1960 (Gallager, 1962). 
It is an explicit and deterministic code, which provides a detailed guideline on how 
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to design such a code. The code’s performance is remarkable, as it approaches capac- 
ity as the code length tends to infinity, although it does not match the capacity pre- 
cisely. However, the LDPC code was not initially given enough credit since it was 
still of high implementation complexity given the digital signal processing (DSP) 
technology of the day.” However, the code was later revived 30 years later, as it 
became an efficient code when the DSP technology evolved, finally enabling the 
code to be implemented. Currently, it is widely being employed in a variety of 
systems, such as LTE, WiFi, DMB,” and storage systems. 

Gallager was not entirely satisfied with his result, as his code was not guaranteed 
to achieve capacity precisely, even in the limit of code length. This motivated one 
of his PhD students, Erdal Arikan, to develop a capacity-achieving deterministic 
code. Arikan developed the first capacity-achieving deterministic code, called polar 
code (Arikan, 2009). Interestingly, he could develop the code in 2007, 30+ years 
later than the motivation for Gallager’s work. Due to its excellent performance and 
low-complexity nature, the polar code is being seriously considered for implemen- 
tation in a variety of systems. We will dedicate the next three sections to study the 
polar code. 


The encoder structure that Arikan imagined Arikan focused on a simple 
channel in which an input to the channel is binary-valued. Such examples are BEC 
and BSC that we learned earlier. He proceeded to examine the statistical character- 
istic that the channel input X” must possess to achieve the capacity. This brought 
to mind the random code utilized by Shannon, where X” is independent and iden- 
tically distributed (i.i.d.). This served as a source of inspiration for Arikan, leading 
to the development of the encoding structure shown in Fig. 2.5. 


iid. 
Ee iid. 
ie Ce X 


| full rank 


(n —nR) random dummy bits 
iid. 


Figure 2.5. The encoder structure of the polar code. 


2. Despite the lack of immediate impact on the world, his remarkable work earned him a position as a faculty 
member at MIT right after graduation. It is fortunate that there are scholars who are patient enough to 
recognize the potential of such groundbreaking work. 


3.  Itis the name of the technology for a digital broadcast system, standing for Digital Multimedia Broadcasting 
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To explain the structure in detail, we first introduce some notations. We denote 
the message W by binary string U”È .— (Ui, U2,..., Une) where T denotes 
a transpose. Note that the message can be represented as nR i.i.d. bits. Why? We 
intend to generate another sequence X” from U”*, Remember that we consider 
a binary-input channel where the capacity cannot exceed 1 (why?). Hence, X” is 
of a longer length relative to U”*. We introduce additional number ( — nR) bits 
that we call dummy bits. Since we want to have i.i.d. sequence X”, we induce the 
random dummy bits which are independent of U”* and also i.i.d. 

The way of constructing such X” takes two steps. First we combine U”? and 
the dummy bits to generate a length-7 sequence, say V”. Here V; takes either a 
component of U”? or that of the dummy bits. For instance, when 2 = 6 and R = 
5 we may have V” = (Uj, U2, dummy}, dummy), U3, dummy,)/. Whether we 
choose the one from U”* or from the dummy bits for the value of V; depends on a 
particular rule that will be specified later on. We then pass V” through a full-rank 
matrix of size 7-by-n, say Gn, yielding: 


X” = G,V”. 


One can verify that X” is also i.i.d. (the property that we wished to obtain) as long 
as G, has full rank. Check in Prob 6.3. It will be clearer soon as to why the full-rank 
matrix is employed here. 

This is the encoder structure that Arikan imagined. Under this structure, what 
he observed is that for some particular G,, an interesting phenomenon occurs. He 
called that phenomenon “polarization” — the rationale behind the naming will be 
clearer later. In fact, he discovered the phenomenon in the process of manipulating 
the following quantity: /(V”; Y”). 


Polarization He made two observations on J(V”; Y”). The first is: 
Observation #1 : I(V”; Y”) = nI (X; Y) =: nI (2.22) 


where X (or Y) indicates a generic random variable for X; (or Y;). We denote 
I(X; Y) by J for notational simplicity. The proof of Observation #1 is as follows: 


IV”; Y”) = H(Y”) — H("|V") 
2 HY”) — H(Y"|X", V” 
2 Ay’) — H(Y"|X") 


2 HY") -Y HOAX) 


i=1 
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2 > HY) — HOX) 
i=1 i=1 
= nl(X;Y)=nl 


where (a) follows from the fact that X” is a function of V”; (b) is due to the Markov 
chain of V” — X” — Y” (why?); (c) comes from the memoryless property of the 
channel; and (d) follows from the fact that Y” is i.i.d. (why?). 

The second observation is: 


Observation #2 : I(V"; Y”) = È I(V; Y”, V1). (2.23) 


i=1 


This is due to the chain rule w.r.t. mutual information and the fact that V; is inde- 
pendent of V1: I(V; Y"|V7-!) = I(V; Y”, VI). If you are not convinced, 
please check in Prob 6.2. Arikan viewed 7 (V; Y”, V^!) as a quantity that indi- 
cates the data rate w.r.t. the th virtual subchannel (say p;) with input V; and output 
(Y”, Ve); 


V; — (virtual subchannel 7) 3 (Y”, VD). (2.24) 


He then made an interesting phenomenon for the quantity 7(V;; Y”, V=!) under 
some particular choice of G,,. To illustrate this, let us plot an empirical CDF (cumu- 
lative density function) of the quantity: 


ea Y”, V!) < xl 


n 


as a function of a dummy variable x. 

You will soon understand why we plot the empirical CDF to see the interesting 
phenomenon. To understand this, let us first consider the case of n = 1. In this 
case, we have only one virtual subchannel where the data rate is (Vi; Y1), which 
is Z due to (2.22). So the empirical CDF is a dirac-delta function jumping at Z. See 
the first subfigure in Fig. 2.6. 

When 7 = 2, the summation includes: 


h := I(Vi; Yi, Ya); 
(2.25) 
h := I(Vz; Yi, V2, Vi). 
Due to Observations #1 and #2 in (2.22) and (2.23): 


h+h=2. (2.26) 
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Figure 2.6. Polarization: /(V;;¥", V‘~!) takes either 1 or O in the limit of n. 


Suppose that J; < J < h under some G2. Then, we would have two jumps at J; 
and J) as plotted in the second subfigure in Fig. 2.6. What Arikan found is that 
under some G, the empirical CDF is of a very interesting shape in the limit of 7: 
the jumps occur only at two extreme points, which are 0 and 1. This implies: 


I(V3 Y”, V!) = 1 or Oas n > oo, (2.27) 


meaning that the data rate of the zth subchannel is polarized (perfect or com- 
pletely noisy). Also (2.22) together with (2.23) suggests that the fraction of 
I(Vz Y”, VTD being 1 (or 0) approaches J (or 1 — J): 


1: (V Y”, VD © 1 
H: ZC ) |a 


n (2.28) 


(i: I(V; Y”, Vi!) ~ 0| 
(a —> 


n 


L=: 


Encoding & decoding The polarization reflected in (2.27) and (2.28) immedi- 
ately suggests the following encoding rule: 


> 


information bit (from U"*), if I(V; Y”, V!) © 1; 
Set V; = : 


dummy bit, otherwise, 


Set R&I. 


You may wonder when /(V;; Y”, V’~!) is close to 1 or 0. It depends on the structure 
of G, that we will investigate in detail later on. 

Now what about decoding? As we learned earlier, the optimal decoding rule 
is ML: Choosing v” such that the corresponding likelihood p(y”"|V” = v”) is 
maximized. But we are not going to employ the ML rule since the complexity of 
the rule is prohibitive. It requires an exhaustive search over all possible choices of 
U"® (why?), i.e., the complexity scales like 2”*. Instead we will use a suboptimal 
yet intuitive and low-complexity rule, so called successive cancellation decoding. This 
is inspired by the subchannel representation as illustrated in Fig. 2.7. 
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Figure 2.7. Successive cancellation decoding. 


The blue-colored (or red-colored) virtual subchannels indicate the perfect (or 
completely noisy) channels. Note in the first subchannel, say pı, that the output 
contains only the received signal Y”. So one can decode V; by employing the fol- 
lowing ML rule associated with that particular subchannel: 


eG" = 1) > pM = 0) Vi = 1; 
pO”\IVi = 1) < p(y” Vi = 0) — Vi = 0. 


On the other hand, in the second subchannel p2, the output contains V; (in addi- 
tion to Y”) which is not available at the decoder. But the good news is that the 
estimate of Vj is available instead once we perform the above operation regarding 
pı. This suggests a successive way of decoding. We first decode V1; we then use this 
to decode V3; and all the way up to decode Vp with V"—!. To be specific, for the 
ith subchannel, the decoding rule is: 


pO" PIV; = 1) = p0" PNY = 0) — Ŷ; = 1; 


pO”, PV; = 1) <p", V; = 0) — V;=0 


where the estimates ô! 


are available from the earlier steps. For time indices 7 in 
which dummy bits are transmitted, we do not need to decode such bits. Hence 
we simply ignore them while using them instead to decode other information bits. 
It turns out that this suboptimal (yet intuitive) decoding rule enables the error 
probability to be arbitrarily close to 0 as long as R < J. The proof of this is not 


that simple. So we omit the detailed proof in this book. 
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Look ahead What follows are two questions that need to be addressed. The first 
question is: How can we create G; to induce polarization? The second question is: 
How can we calculate the likelihood p(y”|07~'|V;) needed for successive cancella- 
tion decoding? These questions will be answered in the next section. 
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2.8 Polar Code: Implementation of Polarization 


Recap In the previous section, we began exploring the polar code, which is the 
first code to achieve capacity with low complexity and explicit construction. The 
encoder structure of the code consists of two stages. In the first stage, nR informa- 
tion bits (U”*) are combined with n — nR dummy bits according to a certain rule 
to construct a longer i.i.d. sequence V”. In the second stage, V” is converted into 
X” by multiplying it with a full-rank matrix G, to achieve polarization. 

Arikan observed that when converting 7 independent copies of a channel p(y|x) 
into 7 virtual subchannels (each having input V; and output (Y”, V!) under a 
particular choice of G,, the data rate of each virtual subchannel [(V;; Y”, V!) 
takes either 1 or 0 in the limit of 7, indicating complete polarization of the sub- 
channels. The polarization phenomenon led naturally to the encoding and decod- 
ing strategies. The encoding rule assigns information bits to good channels (with a 
data rate of 1) and dummy bits to bad channels (with a data rate of 0). The decoding 
rule follows successive cancellation decoding, where V; is decoded in a step-by-step 
manner from i = 1 to i = n with the aid of previously decoded bits V'—!. 


Outline This section will cover two topics that were not previously discussed. 
Firstly, we will investigate the construction of G,, which is necessary to achieve 
polarization. Secondly, we will delve into the computation of the likelihood func- 
tions p(y”, v’'|v;), which are required for successive cancellation decoding of the 
virtual subchannels. 


Case ofn = 2: Choice of G, & likelihood computation Let us start with the 
simplest case of n = 2. In this case, one obvious yet non-trivial choice for G3 is: 


1 1 
G2 = | : (2.29) 
0 1 


Lid. 


full rank 


(n — nR) random dummy bits 
iid. 


Figure 2.8. The encoder structure of the polar code. 
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2 
y; _ 6 p> ”. 
2 


Figure 2.9. (Left): Two independent channels and the mapping between (Vj, V2) and 
(X1,X2); (Right): Two converted virtual subchannels (denoted by p4 and p2 respectively). 


Clearly, no polarization occurs when G73 is either 


1 0 0 1l 
or : 

0 1 1 0 
One can readily verify that in the case, the data rate of the first virtual subchannel 
Iı := I (V1; Y?) coincides with that of the second subchannel D := [(V; Y?, V1), 
meaning no polarization between /; and J (please check). The choice of (2.29) is 
one of the two remaining non-trivial candidates. This choice yields: X; = V; ® V2 
and X% = V; see the left in Fig. 2.9. In the previous section, we converted these 


two independent copies of p(y|x) into two virtual subchannels, inspired by the 
following formula: 


1(V?; ¥*) = IVi; Y>) + (V2; Y’, Vi). 


See the right in Fig. 2.9 for illustration of each virtual subchannel, say p; where 
ie {1,2}. 

Now recall one of the two questions raised earlier: how to compute (p1, p2) 
which are required to perform successive cancellation decoding? Here the key is to 
represent p; in terms of the one that is known: the conditional distribution of the 
DMC p()|x). First we get: 


( 
aryl) 2 > PO 192; v2|01) 
v2€{0,1} 


> pilvi) y2lv1, v2) 


v2€{0,1} 


9 


1 
=7 > P(y2le1, 22)pCi|e1, 22592) 
v2€{0,1} 
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1 
= 5 >» P(y2l2)p (yi lvi, v2, 92) 


v2€{0,1} 


$ poplo @ v) 


v2€{0,1} 


where (a) follows from the total probability law; (4) is due to the definition of 
conditional distribution; (c) comes from the definition of conditional distribution 
and the fact that p(v2|v1) = $; (d) follows from the Markov chain: Vj — X% (= 
V2) — Yo (why?); and (e) is due to the Markov chain: (V1, V2, Y2) — Vi ® V2 — Yı 
(why?). Notice that pı (y1, y2|v1) is represented in terms of p(y2|v2) and p(y1|v1 ® 
v2) which are known and therefore can be computed. Similarly we get: 


Hy vil) = polv pC, yvi, 22) 
1 
= sh (vale2)eCrilen ® v2). 


Example: Binary Erasure Channel To give you a concrete feel as to how to 
compute pı and p2 derived as above, let us give you an example of a binary erasure 
channel. Let a denote erasure probability of the BEC. Note in the BEC that 


(Vi ® V2, V2), w.p. ad - a)’; 


Yh) = (e, V2) w.p. a(l — a); 
(Vi ® V2, e) w.p. (1 — aja; 
(ee) w.p. a. 


One can decode Vj only when there is no erasure (the first case). Hence, one can 
view pı as another BEC yet having different erasure probability: 1 — (1 — a)?. Since 
I, = (1 — a)? is smaller than the capacity of the original BEC (Cgec = 1 — a), 
pı is sort of a bad channel. 


On the other hand, 
(Vi ® V2, V2, Vi), w.p. (1 — a)?; 
(e, V2, Vi) w.p. a(l — a); 
(Y1, Y), Vi) = 
(Vi ® V2, e, Vi) w.p. (1 — a)a; 
(e, e, Vi) w.p. a?. 


One can decode V2 except for one case in which both channels are erased (the last 
case); hence, pz can be viewed as another BEC with erasure probability a”. Since 
h = 1 — a? is larger than Cgec = 1 — a, p2 is sort of a good channel. 
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BEC(a”) @Q—. 


Ye BEC((a*)?) 


=a 


Figure 2.10. Channel splitting for arbitrary n. 


The above two interpretations suggest that the two independent copies of p(y|x) 
can be split into two virtual subchannels: one with a smaller capacity (the bad 
channel); the other with a larger capacity (the good channel). So we can view this 
split as partial polarization. 


Idea: Channel splitting for an arbitrary n This observation leads to a natural 
way of polarizing 7 independent copies of p(y|x). The idea is to repeat the same 
for the two split subchannels. See Fig. 2.10. 

We first split the original BEC (with erasure probability a) into two subchannels, 
say p7 and p™ to indicate the bad and good subchannels respectively. Then, the 
erasure probability of the bad (or good) channel would be a7 := 1 — (1 — a)? 
(or a* := a”). Similarly we split p~ into another set of two split subchannels, say 
p. and p~t where a~~ := 1 — (1 —a7~)* and a~* := (a7). To this end, we 
should first construct two p~’s from four p’s. Similarly p* is split into (pt, pt*). 
We repeat this until we get 7 split subchannels. This way, one may imagine that 
the subchannels are completely polarized: the data rate of a split subchannel in 
the final stage is either 1 or 0 in the limit of 7. It turns out this is indeed the 
case. 


How to implement the idea? Before proving the complete polarization of the 
subchannels, let us explore how to implement the channel-splitting idea, i.e., how 
to construct G; that yields the splitting? Consider a case in which z is of the form 
2". Let us start with n = 22. First of all, we merge the first and second copies of 
p(ylx) to construct (p7, p*), also merging the third and fourth copies to construct 
another (p~, pt). We then combine the two p~’s to construct (p~~,p~*). The 
way to construct is the same as before: adding two inputs of (p7, p~) to yield the 
first output while simply passing the second input to yield the second output. See 
the modulo addition on top in the left of Fig. 2.11. We add V, and V3 (inputs 
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yt 
vı 
v; dd (Y4, Vi, V3) 
2 


y E (Y4, v1) 

s 

7 ED (Y4, Vi, Vs, Va) 
4 


Figure 2.11. Channel splitting for n = 4. 


Vi 


V2 


V3 


Va 


Figure 2.12. Relationship between V4 and Xí, reflected in G4. 


of p` subchannels) while simply passing V3. We do the same for the remaining 
p*’s, thus yielding (p*~,p**). Now remember how we constructed outputs of 
polarized virtual subchannels. The output of p~~ should read a collection of the 
outputs of the p~’s: Y4 = (Y1, Y2, Y3, Y4); and the output of p~+ should read a 
collection of Y4 and the input V1 of the first p~. Similarly the outputs of p+~ and 
p'* should be: (Y4, Vi, V3) and (Y4, Vi, V3, V2) respectively. 

Then, what is G4 that implements the mapping? To see this, let us represent 
the four virtual subchannels in terms of the four independent copies of p(y|x). See 
Fig. 2.12. 

From this, we see that 


Xi = V+ Vo +V + Vas 
X% = V2 + V4; 

X3 = V3 + Vs; 

X4 = V4. 
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Figure 2.13. Relationship between V6 and X'S, reflected in Gig. 


Hence, we get: 


G2 
G4 = 2] : 
This observation leads to the following for the case of n = 2°: 
Gz- | Gori 
= : 2. 
e | 0 ae ( 30) 


See Fig. 2.13 for n = 24, 


Mathematical statement of polarization As claimed earlier, the construc- 
tion (2.30) enables the complete polarization as 7 tends to infinity, meaning that 
1(V;; Y”, V‘—) is either 0 or 1 in the limit of n. To prove this, let us first introduce 
some notations. Let Z, J~, /* be the data rate associated with p, p`, pt respectively: 


I = I(X;Y); 
D =1(Yi3 y, Yo); 
I+ = I(Vz; Yi, Yz, Vi). 
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As we verified earlier, 27 = J~ + I*; see (2.23). Similarly we define 
aaa ae ae ce 

= I(Vi; Yi, Y2, Y3, Y4); 

I4 = Var Yi, Y2, Y3, Y4, Vi); 

p= = I(Vz; Yı, V, Y3, Yz, Vi, v3); 

I++ = I(Vg Yi, Yas Y3, Y4, Vi, V3, V2). 
You may wonder why J~* (or J*7~) is not of the form h = I(V; Y‘, Vı) (or 
I, = I(Vz; Y4, V?)). However, swapping the role of V2 and V3, we see that the 
above is equivalent to the form of (J), J2, 43,14). This swapping corresponds to 
changing the positions of Vz and V3 in Fig. 2.12. As before, one can show that 
217 = I7 +I and 2/* = Jt + J**. Let B; be a polarization sign in the ith 


layer where i € {1,2,..., k}. Then, for general n = 2%, we have Ip := P222k, 
Similarly we can show that 


2l =I; +I}, Vk (2.31) 


where J, (or TE) indicates Z21827 Bk— (or JP1827: B+), 
What the complete polarization means is: in the limit of k, 


I} — lor0. (2.32) 


This is the one that we intend to prove. In addition to this, we need to show that 
the fraction of the good subchannels is 7. The second is much easier to prove if we 
make some assumption on B;s. Suppose that B;’s are i.i.d., each being according to 


1 
poe cae w.p 7? 
í +, w.p. 5 
Then, we get: 
eee 1 
I > WP. 553 
a 1 
I +, wp. J 
IT, w.p. F 
p=}. 
Poe. w.p. F 
E - Sas 
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This yields: Vk, 
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1 
Eel = 5 C +1 
@ 1 ioe 
2 g D 
= 


where (a) follows from $7”, (V; Y”, V=!) = nl. This is what we already 
proved in the previous section; see (2.22) and (2.23). This then immediately implies 


that the fraction of the good subchannels is /: 


PU = 1) = 


El] = Z. 


Look ahead Inspired by a simple observation made in n = 2, we could come up 
with G; as in (2.30). We then claimed that under the G,, the polarization formally 
stated in (2.32) occurs. In the next section, we will prove this claim. 
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2.9 Polar Code: Proof of Polarization and Python 
Simulation 


Recap In the last section, we came up with an encoding strategy that enables 
the perfect polarization. We designed the full-rank matrix G,, which relates V” 
(consisting of nR information bits and 2 — nR dummy bits) to X” (channel input) 
as follows. For n = 24, 


Gy = [Se aa (2.33) 


where G2 = [1 1;0 1]. The key idea behind this matrix structure is to split the 
original channel into bad and good subchannels infinitely many. We first split the 
original channel p (with data rate J := J(X;Y)) into two subchannels: (1) p7 
channel with a smaller data rate, say /~; and (2) p* channel with a larger data rate, 
say J+. From J~ < I < I*, we see some partial polarization, i.e., data rates of 
the two split channels are being apart, being polarized to some extent. Repeating 
this, we obtain four subchannels (p77, p~*, pt, ptt). From ~~ < I7 <I < 
[tt 


It < I**, we see a larger difference between J~~ and , meaning further 


polarization. Repeating this k times, we get 2% subchannels (p~""~,..., p++) in 
the end. We claimed that the complete polarization on those subchannels occurs in 
the limit of k: 


por converges either to 1 or to 0. (2.34) 


Outline In this section, we will bring the polar code story to a conclusion by prov- 
ing the claim (2.34) mentioned earlier. To do so, we will take the following steps. 
First, we will introduce some mathematical notations to clearly state what needs to 
be proved. We can actually prove the polarization by relying on a prominent the- 
orem in the random process literature. In the second part, we will delve into this 
theorem, which is known as the bounded martingale theorem. Then, we will use 
the theorem to prove the polarization. Lastly, we will provide numerical evidence 
of polarization through a Python simulation. 


A simpler notation for "7+ Recall the simpler notation for J*—~"—*. Let B; 
indicate the polarization sign in the ith layer where 7 € {1,2,...,k} and n = 24, 
Then, such data rate can be represented as 751" 2+, For notational simplicity, let 
Ty := IP24, In terms of this notation, what we intend to prove is then: 


Ip — lor0. (2.35) 
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Observation The converging value in the above is either 1 or 0, meaning there 
is uncertainty in the quantity. So we can view it as a random variable. This implies 
that 74 is a random process, so is B;. 

Then, what is the statistics of the random process {B;}? There is one key con- 
straint that the statistics of {B;} needs to satisfy. Since J, indicates one of the data 
rates of the 7 split subchannels, the aggregation of all the possible values should be 
the same as nI (remember Observation#1 that we made in Section 2.7): 


which is equivalent to: 


1 
+ = jj ced Sons "o un +++) =]. (2.36) 
n 


Suppose {B;}’s are i.i.d. ~ Bern(Ż). Then, the above (2.36) implies that 


E Zp] = Z. (2.37) 


If we can prove the perfect polarization (2.35) in the end, then this together 
with (2.37) yields: 


P(} = 1) =1. (2.38) 


Here (2.38) means the fraction of the perfect channels is 7, which is what we need 
to satisfy. Hence this leads us to assume that B;’s are i.i.d., each being according to: 


B; = R wp: 
P 


Convergence of the random process /, The proof of (2.35) requires the 
proof of the convergence of J,. It turns out that J, belongs to a special class of 
random processes and there is a key theorem for the special class that serves to 
prove the convergence. 

To figure out what the special class is, let us make some observations. One obser- 
vation is: 


0<&%< 1. 


This is obvious as we consider a binary-input channel where the capacity cannot 
exceed 1. This means that the random process J, is bounded. Another observation 


can be made on the following quantity: E[/,41|B1, Bo, ..., Bg]. Viewing & as the 
current time index, one can interpret this quantity as the expected future outcome 
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given the current & past knowledge. Observe that given (B1, . . . , Bg), p41 is a sole 
function of Bg41: 


I, wp. P(B = —-) = 3; 
Ie = | re. lies ie _ a (2.39) 
where J, (or IF) indicates 751227 Bi— (or [2182 Bet), 
This then yields: 
EU+1|B1; Bo, ..., Be] = P(B} = —)I; + P(B = +F 
= 50; +1) (2.40) 


=i, 


where the last equality follows from the fact that 27, = J, + J, i (why?). Notice 
that the expected future outcome, reflected in E[J,41|B1,..., Bg], is the same as 


the current outcome 1z. This is the key property that characterizes one of the well- 


known random processes, called the martingale. We say that a random process is a 
martingale if (2.40) holds.“ 


Bounded martingale theorem There is a well known result as to the conver- 
gence of such a bounded martingale: for a bounded martingale 74, 


Tp — l» almost surely (2.41) 


where Joo indicates a random variable that represents the limit of 74. You may won- 
der what the “almost surely” means. Remember in Part I that we learned about one 
type of convergence w.r.t. arandom process. That is, the convergence in probability. 
There are a couple of more types of convergence regarding a random process. One 
such type is the convergence almost surely. Mathematically, it means: 


P ( lim J = In) =i, (2.42) 


k>0o 


What it means is that the limit of J, is almost surely J>% as the name of the con- 
vergence suggests. As the expression of (2.42) indicates, another name of it is the 


4. The term “martingale” has an interesting historical origin in the gambling realm where it was used to describe 
a particular betting tactic. In the context of probability theory, however, it refers to a stochastic process 
that models a fair game of chance. For example, suppose we have a game where B; represents the amount 
of money gained or lost in the ith round of play. In this scenario, the values of B; are independent and 
identically distributed, taking either —1 or 1 with equal probability, ensuring that the game is unbiased. 
Consider J, = Daa B; which denotes the capital after k games. Note that {74} is a martingale as it satisfies 
the property (2.40): E[Z,41|B1,..., Bg] = Jp. We see that this property is a consequence of the fair game. 
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convergence wp. 1. Actually this is a stronger type of convergence relative to the 
one in probability because: 


(2.42) => lim P (| — Jœl <€)=1 foranye > 0. 
k= 00 


It turns out that the other way around does not necessarily hold, i.e., there are 
examples in which only the convergence in probability holds (think about such 
examples). Hence, the convergence almost surely is of a stronger type. 

In fact, the proof of the bounded martingale theorem (2.41) is not that simple. 
This book does not cover the proof of the theorem as it requires a background in 
probability theory, which is not covered here. If you are interested in the proof, you 
may refer to a graduate-level textbook on probability theory, such as (Grimmett 
and Stirzaker, 2020). 


Proof of polarization: /o =10r O The key property on {Jp} reflected in (2.41) 
means that 7% actually converges to J>. For the proof of polarization, it suffices to 
show that 


I% = lor 0. 


The proof of this requires the following two: (i) the bounded martingale the- 
orem; and (ii) the recursive relationship between J, and J441, reflected in (2.39). 
Using (2.39), we get: 


E(lZet1 — Zel|Bi,..., Be] 


(a) 1 _ 1 

= 5G + 5GF -h) (2.43) 
1 z 

= se —I,) 


where (a) is due to (2.39). On the other hand, the bounded martingale theo- 
rem (2.41) yields: 


Ip — Ino almost surely & 


[p41 — Igo almost surely. 
This implies that: 
|+ — | — 0 almost surely. 


This then yields: 


E[|Ze+1 — 4|] — 0 almost surely. (2.44) 
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iP al=o7 


Figure 2.14. Channel-splitting from /, to Ups tf). 


Now using the tower property,’ we get: 


E[l} — Zel] = Ea,.....3,[Ea,,, [M+ — ZellB1,..., Bell. 


For a non-negative random process |J,41 — J,|, if its mean converges to 0, such 
random process also converges to 0. Hence, the above tower property together 


with (2.44) yields: 
Eg, [e1 — Zel|Br,..., Be] — 0 almostly surely. (2.45) 
Applying (2.43) to the above (2.45), we get: 
I} — I7 — 0 almostly surely. (2.46) 


Note that 74 is a random process which takes one of the virtual subchannels. Such 
virtual subchannels are of the same type as the original channel: BEC (remember 
what we learned in Section 2.8). So one can think of erasure probability for 74, say 
az. Then, J, = 1 — ay. Applying the channel-splitting idea that we came up with 
earlier, the bad channel split from the ag-channel would havea, = 1- (1 — a p 
on the other hand, the good channel would have af = aj. See Fig. 2.14. 

This then gives: 


I, = (1— a4)’ 
J : (2.47) 
L =l1- ag. 


5. The tower property is a useful and well-known property that arises in the random process context. It says: 
for random variables (or vectors), say X and Y, E[X] = Ey[Ex[X|Y]]. One can readily prove this using 
the definition of conditional probability. Try Prob 6.1. 
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This together with (2.46) yields: 
1—az—(1—a4)* — 0, (2.48) 
which is equivalent to: 
ap — Oorl. (2.49) 
Hence, we complete the proof: 
l} — lor0. (2.50) 


Extension Thus far, we have outlined the narrative of the polar code, including 
the encoder structure that facilitates polarization and the proof of polarization using 
the bounded martingale theorem. We have primarily focused on the binary erasure 
channel (BEC) for the sake of simplicity, but it should be noted that the polar code 
has been shown to achieve capacity for all binary-input channels. The expansion to 
the general case is explored in Prob 6.5 and Prob 6.7. If you wish to demonstrate 
polarization for binary symmetric channels, follow the guidelines in Prob 6.5. If 
you prefer to show polarization for binary-input memoryless channels, refer to the 
instructions in Prob 6.7. 


Python simulation of polarization We will use a Python simulation to con- 
firm the polarization of J,. We will simulate the same scenario we have been focus- 
ing on: n = 2%, Ip := [P822k and the channel is the BEC with an erasure 
probability a. Since /g41 takes either 7, ‘or I a ' which are generated recursively 
from Jz, we will use a binary tree class with top and bottom children. 
class TreeNode: 
def __ init__(self, val, top=None, bottom=None): 

self.val = val 

self.top = top 

self.bottom = bottom 


We wish to construct a binary code tree as illustrated in Fig. 2.14. We assign 74 
to nodewval and associate the top and bottom trees with J, and Z, respectively. 
The relationship between J; and J, (or TF) reads: 


-~_ 7 
Ay =d; 
I =1- (0. 
We iterate this procedure until we reach all the leaves. See below for code imple- 
mentation. 


k=20 
n=2**k 
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res=[] # an array for containing all the |_k’s 
# k=2 --> "res" is of the following structure 
# res=[[X] [XX] [XX%,X%,X1] 
alpha = 0.4 # erasure probablity of BEC 
root=TreeNode(l-alpha) # /=/-a/oha (root node) 


# Recursive function for tree generation 
def rec_treeGen(node,depth): 
# initialization of the resulting array 
if lenCres)<=depth: res.append([]) 
# Append node.val in the depth level 
res[depth].append(node.val) 
# If reaching a leaf node, terminate the process 
if depth==k: return 
else: # for an internal node, iterate further 
# Construct the top node and go deeper 
node.top=TreeNode(node.val**2) 
rec_treeGen(node.top,depth+1) 
# Construct the bottom node and go deeper 
node.bottom=TreeNode(1-(1-node.val)**2) 
rec_treeGen(node.bottom,depth+1) 
rec_treeGen(root,O) 


Using the resulting array res, we can then plot an empirical CDF of J, as in 
Fig. 2.6: 


|{z : res[k][i] < x| 


n 


import numpy as np 


# range of x 
x_grid=np.arange(0,1,0.0001) 
# initialization of cdf 
cdf_1=[0]*len(x_grid) 
cdf_2=[0]*len(x_grid) 
cdf_2_4=[0]*len(x_grid) 
cdf_2_20=[0]*len(x_grid) 

# Sorting I_k 

# Why? To ease the computation of empirical cdf 
sres_1 = sorted(res[O]) 
sres_2 = sorted (res[1]) 
sres 2 4 = sorted(res[4]) 
sres_2 20 = sorted(res[20]) 


for i,x in enumerate(x_grid): 


162 Channel Coding 


# Case n=] 
for j in rangeClen(sres_1)): 
if sres_1[j]>x: 
cdf_1[iJ=j # because sres_/ is sorted 
break 
# if sres_I[j]<=x for all j, set cdf(x)=1 
if j==len(sres_1)-1: cdf_1[i]=1 
# Case n=2 
for j in range(len(sres_2)): 
if sres_ 2[j]>x: 
cdf_2[i]=j/2 # divided by n=2 
break 
if j==len(sres_2)-1: cdf_2[i]=1 
# Case n=2°4 
for j in range(len(sres_2_4)): 
if sres_2_ 4[j]>x: 
cdf_2_4[i]=j/(2**4) # divided by n=2°4 
break 
if j==len(sres_2_4)-1: cdf _2_4[i]=1 
# Case n=2°20 
for j in range(len(sres_2_20)): 
if sres_2_ 2O[j]>x: 
cdf_2_20[i]=j/(2**20) 
break 
if j==len(sres_2_20)-1: cdf _2_20[i]=1 
# Set cdf_2_2O[-1]=1 since x_grid is not dense enough 
cdf_2_20[-1]=1 


Here is a code for plotting the empirical CDFs for four cases: n = 1, 2, 24, 27°. 


import matplotlib.pyplot as plt 
import matplotlib 


plt.figure(figsize=(10,2),dpi=500) 
# Adjust the font size of axis tick values 
matplotlib.rcCxtick’, labelsize=7) 
matplotlib.rcCytick’, labelsize=7) 


plt.subplotd,4,1) 
plt.plot(x_grid,cdf_1,color=’red’) 
plt.xlabel(’x’, fontsize=10) 

plt.ylabel( empirical CDF’, fontsize=8) 
plit.title? $n=1$’, fontsize=10) 
plt.subplotd,4,2) 
plt.plot(x_grid,cdf_2,color=’red’) 
plt.xlabel(’x’, fontsize=10) 
plit.title? $n=2$’ fontsize=10) 
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Figure 2.15. Python simulation for verifying polarization. As n increases, the number of 
perfect subchannels converges to / = 1— a (=0.6 in this simulation). 


plt.subplotd,4,3) 
plt.plot(x_grid,cdf_2_4,color=’red’) 
plt.xlabel(x’, fontsize=10) 
plt.title? $n=2° 49’ fontsize=10) 
plt.subplotd,4,4) 
plt.plot(x_grid,cdf_2_ 20,color=’red’) 
plt.xlabelCx’, fontsize=10) 

plit.title? $n=2°{20}$’,fontsize=10) 
plt.showQ 


Note that the obtained curves follow the same trend as those in Fig. 2.6: 


Look ahead In Parts I and II, we delved into the source and channel coding theo- 
rems, respectively, and also explored important information-theoretic measures like 
entropy, mutual information, and KL divergence. Additionally, we placed signifi- 
cant emphasis on the concept of phase transition, which resembles a fundamental 
law in physics. These information-theoretic tools and the phase transition concept 
are instrumental in addressing critical issues that emerge in various fields beyond 
communication. Part III aims to showcase their role in data science applications, 
ranging from social networks, ranking, computational biology, machine learning, 
and deep learning. The following section will commence our investigation into 
their role within the context of social networks. 
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Problem Set 6 


Prob 6.1 (Basics) 


(a) Let X; and X be Bern(5) random variables which are independent with 
each other. Let Z = X, ® X. Show that Xj and Z are independent, i.e., 
IX; Z) 0. 

(4) A curious student claims that X; and Z (in part (a)) are independent even 
if X is given, i.e., I (X1; Z|X2) = 0. Prove or disprove it. 

(c) Let X and Y be random variables (or random vectors). Prove the tower 
property: 


E[X] = Ey[Ex[XIY]]. 


Prob 6.2 (Chain rule for mutual information) Consider two random pro- 
cesses: {X;} and {Y;}. 


(a) Show that 


IY) = > 1G Yi ye) 
j=l 


where X’ := (X},...,X;) and Y’ := (Y%,..., Yj). 
(b) Let f(z,7) be a function of 7 and j where 7,7 € N. Show that 


>» faa= >> ep 
i=1 j=l jal i=j 


(c) Using parts (4) and (b), show that 
DS VAY = DTG YI, YI?) 
i=1 j=l 


where y” = (Y, Yai... +> Yn). 


Prob 6.3 (Useful facts in the polar code) Let {V;} bean i.i.d. random process 


~ Bern(5). Let G, = [Gj] be a full rank matrix of size n-by-n where each entry 
Gj is a binary value. Let X” = G,V” where V” := [V1, V2,...5 Va] T denotes an 
n-by-1 column vector. 


(a) Consider a case in which n = 2 and 
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Show that Xj and X are i.i.d. 

(b) Consider X; = >°7_, Gig Ve and Xj = X 4—1 Gje Vg where >? indicates the 
modulo-2 addition and i Æ j. Argue that there exists some component V; 
which appears in X; but not in X; (or vice versa). Show that /(X;;.X;) = 0, 
i.e., X; and X; are independent. Also show that X;’s are i.i.d. 

Hint: The full rank condition on G, implies that its row vectors are linearly 
independent. 

(c) Let Y” be the output of a BEC when X” is fed into. Show that 


1(V"; ¥”) = 1(X"; Y”) and I(X”; ¥”) = nI 


where J := /(X; Y) and X (or Y) denotes a generic random variable for X; 
(or Yj). 
(d) Show that 


AVS) WYER). 
i=1 


Prob 6.4 (Polarization in BEC) Suppose that V; and V3 are i.i.d. ~ Bern(4), 
and (X1, X2) are constructed through G2 = [1, 1; 0, 1]: 


Klefo olla} 


We pass (X1, X2) through two independent copies of a BEC, say p, with era- 
sure probability a, thus yielding the channel output (Yj, Y2). As we learned in 
Section 2.7, we convert the collection of the two independent copies into two vir- 
tual subchannels: 


p: V> M, N); (2.51) 

pr: V2 > (Yi, Ya, Vi). (2.52) 
Let (Z, I7, 1*) be the data rates associated with (p, p~, p*) respectively: 

21 = 1(X, X; NY, Yo); 

I~ = I(Vi; ⁄i, Y2); 

I* = I(Vo3 Yi, Yo, Vi). 


(a) Compute (/,/~,/*). Express them in terms of a. 
(b) Can p` be interpreted as another BEC? If so, explain why and indicate the 
corresponding erasure probability. 
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(c) Repeat part (b) for pt. 
(d) Compare J~ and J+. Which one is larger? 


Prob 6.5 (Polarization in BSC) Consider the same problem setup as that in 
Prob 6.4. The only distinction is that the channel is a BSC with crossover proba- 
bility a. 
(a) Compute H(Vi|%, Y2) and I (V1; Y1, Y2). 
(b) Compute H(Vi| Y1 ® Y2) and (Vi; Yı ® V2). 
(c) Using the above, show that (V1; Y1, Y2|Y1 ® Y2) = 0, i.e., Vi — (Yı ® 
2) — (%1; Y2). 
(d) Can p` be interpreted as another BSC? If so, explain why and indicate the 
corresponding crossover probability. 
(e) Show that 27 = J~ + J*. Using this and part (d), compute JT. 
(F) Compare J~ and J+. Which one is larger or same? 
(€) Show that when J+ — IT = 0, H (a) is either 0 or 1. 
(b) The result of part (g) proves the perfect polarization for BSC. Explain why. 


Prob 6.6 (Python simulation for BSC polarization) Consider a setting 
where n = 2*, Ip := I1 2k, and the channel is a BSC with crossover probability 
a. Let J be the capacity of the BSC: J = 1 — H (a). 


(a) Using the result in Prob 6.5, express J~ and J* in terms of a. 

(b) Using part (a) and the skeleton Python code in Section 2.9, construct a 
code yielding all the possible values (say res[k][i]) that J% can take on. Here 
res[k] is a 2*-sized array that contains all the values for Jp. 

(c) Set a = 0.3. Using part (b) and the skeleton code in Section 2.9, plot 
empirical CDFs for four cases: n = 1, 2, 24, 270: 


|{z : res[k][i] < x| 


n 


Prob 6.7 (Polarization in B-DMC) Consider the same problem setup as that in 
Prob 6.4 and Prob 6.5. The only distinction is that the channel is a B-DMC where 
the channel input is binary-valued. Let Y be the range of the channel output. 
(a) Argue that given Yı = y1, Xı ~ Bern(q;) where qı is a function of y1. 
Using this, show that 


1:=1(X3%1) =1- > pA). 
ney 


(b) Argue that given (Y1, Y2) = (y1, y2), channel inputs (X1, X2) are two inde- 
pendent binary random variables, say X; ~ Bern(q;) where q; is a sole 
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function of y;, 7 € {1,2}. Using this, show that 


D =1- >) Dd) ODODE (a0 -= p) + = qq). 


NEY EY 


(c) Using 27 = I~ + I*, show that 


It=1+4 > Si eGvpn)H (aa — 2) + - 91)q2) 
ney EY 


-2 Š pHa). 


ney 


(d) Show that when J+ — J~ = 0, J is either 0 or 1. 
(e) The result of part (d) proves the perfect polarization for B-DMC. Explain 
why. 


Prob 6.8 (Python simulation for B-DMC polarization) Consider a setting 
where n = 2°, Ip := IPU“ Be, and the channel is a B-DMC. Let 7, /~, J+ be the 
quantities derived in Prob 6.7. Assume that Y = {0, 1, 2}. 


(a) Using the skeleton Python code in Section 2.9, construct a code yielding 
all the possible values (say res[k][i]) that 7, can take on. Here res[k] is a 
2*-sized array that contains all the values for 74. 

(b) Set qi = 4 for i € {1,2}. Using part (a) and the skeleton code in 
Section 2.9, plot empirical CDFs for four cases: n = 1, 2, 24, 27°: 

[{z: res[k][i] < x| 
n 


Prob 6.9 (True or False?) 


(a) LetX, Y,Z be Bernoulli random variables. Suppose /(X; Y|Z) = 0. Then, 
I(X;Y) =0. 
(b) Let (X1, X, Y1, Y2) be discrete random variables. Suppose 


I(X%3X2| Vi) = 0; 
IX; %2) = 0. 
A curious student claims that 
IX; X211, Yo) = 0. 


Either prove or disprove the statement. 
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Chapter 3 


Data Science Applications 


3.1 Social Networks: Fundamental Limits 


Concepts and tools learned from Parts | and II In Parts I and II, we inves- 
tigated the well-known theorems in information theory: the source coding theo- 
rem, channel coding theorem, and source-channel separation theorem. In Part I, 
we introduced important concepts in information theory such as entropy, mutual 
information, and KL divergence. We also examined the idea of prefix-free codes, 
typical sequences, and some lower and upper bounding techniques. With these 
tools, we proved the source coding theorem and analyzed a specific code, the 
Huffman code, along with its implementation in Python. 

In Part II, we explored various concepts and techniques related to channel codes. 
One crucial idea we discussed was the phase transition, where there is a critical data 
rate below which we can make the error probability arbitrarily close to zero and 
above which no matter what we do, the error probability is not zero. We used tech- 
niques like random coding, maximum likelihood decoding, joint typicality decod- 
ing, and union bound for the achievability proof, and Fano’s inequality and data 
processing inequality for the converse proof. We also proved the source-channel 
separation theorem and demonstrated that feedback cannot increase capacity in 
a DMC. Moreover, we examined the polar code, an explicit channel code that 
achieves the capacity of a specific type of memoryless channel, binary-input DMCs. 
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The concepts and information-theoretic notions we have explored are applica- 
ble in various domains beyond communication. In particular, these concepts are 
instrumental in data science. One example is the occurrence of phase transition in 
various inference problems within data science. Examples include: (i) community 
recovery of social networks (Girvan and Newman, 2002; Fortunato, 2010; Abbe, 
2017; Chen et al., 2016b); (ii) DNA sequencing in computational biology (Brown- 
ing and Browning, 2011; Das and Vikalo, 2015; Chen et al., 2016a; Si et al., 2014); 
(iii) ranking in search engine (Negahban et al., 2012; Chen and Suh, 2015); and 
(iv) matrix completion in recommender systems (Candés and Tao, 2010; Keshavan 
et al., 2010; Candes and Recht, 2012; Ahn et al., 2018; Elmahdy et al., 2020; Zhang 
etal., 2021). The KL divergence is utilized in the development of a loss function for 
optimizing supervised learning, which is one of the most significant frameworks in 
machine learning. Mutual information is used to enhance powerful unsupervised 
learning frameworks such as generative adversarial networks (GANs) (Goodfellow 
et al., 2014). Recently, mutual information has also been applied in the design of 
new machine learning models known as fair prediction models. These models not 
only ensure accurate predictions but also guarantee fairness in prediction statistics 
for different groups and individuals (Larson et al., 2016; Zafar et al., 2017; Cho 
et al., 2020; Roh et al., 2020). 


Goal of Part Ill In Part III, our aim is to illustrate the significance of the phase 
transition concept and fundamental notions through the exploration of three infer- 
ence problems and three machine learning models: 


(1) Inference problem #1: Community detection; 

(2) Inference problem #2: DNA sequencing; 

(3) Inference problem #3: Ranking; 

(4) Supervised learning: Design of a loss function; 

(5) Unsupervised learning: Generative Adversarial Networks (GANS); 
(6) Fair machine learning: Design of fair classifiers. 


Outline This section will examine the first inference problem: Community detec- 
tion. It consists of four parts. Initially, we will study what community detection is 
and discuss why it is essential in data science. Next, we will establish a related mathe- 
matical problem and address a crucial issue regarding phase transition. We will then 
establish a connection to the communication problem we have previously studied. 
Finally, we will investigate how phase transition manifests in the problem context. 
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figure out which user belongs to 
“node” indicates “user”. which community. 


Figure 3.1. Community detection in picture. 


Community detection (Girvan and Newman, 2002; Fortunato, 2010; 
Abbe, 2017) Community refers to a collection of individuals with shared inter- 
ests or residing in the same locality. The objective of community detection is to 
identify similar groups, as illustrated in Fig. 3.1. Suppose users indicate nodes in a 
graph (shown on the left in Fig. 3.1), and there are two communities: blue and red 
communities. The aim of the problem is to determine which user (node) belongs 
to which community among the blue and red communities (shown on the right 
in Fig. 3.1). Clustering is another term for this problem (Bansal et al., 2004; Jalali 
et al., 2011). Note that the output nodes are clustered into either a blue or a red 
community. 

One may wonder why we should be concerned with this problem. The prob- 
lem arises in several critical domains, such as social networks (Facebook, LinkedIn, 
Twitter, etc.), and biological networks (Chen and Yuan, 2006). In social networks, 
identifying community memberships can assist in identifying target groups for 
product advertisements. In biological networks, the problem is relevant to DNA 
sequencing for cancer detection and personalized medicine. In Section 3.5, we will 
discuss how this problem is connected to DNA sequencing. The problem has appli- 
cations beyond these examples. 


Problem formulation First consider the type of information that can be accessed 
in many applications. One common type of information that is easily accessible is 
relationship information, such as friendship in Meta’s social network, connections 
in LinkedIn, and followers in Twitter. However, community memberships are often 
not revealed in practice. For example, in Meta’s social network, only the friendship 
information is available, and it is unknown whether a user belongs to a particular 
community. In privacy-concerned contexts, this information is prohibited from 
being made public by law, even if Meta has access to it. 
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Figure 3.2. An example of community detection. 


To give you a concrete feel as to what the relationship information looks like, let 
us give you an example. Suppose that x; indicates community membership of user 2, 
and we assign x; to node 7. Then, one natural function is a parity function between 
the values assigned to two nodes, e.g., x; and x;. For instance, when x} = x2, we 
get, say y12 = 0; otherwise y12 = 1. 

Given a collection of y,’s, the goal of the problem is then to decode x := 
[x1,%2,...+5Xy]. Upon reflection, this objective is unattainable as decoding x with 
precision is not feasible based solely on the parities y;’s. To illustrate this point, 
consider the scenario depicted in Fig. 3.2. Suppose that n = 4 and (12,913,714) 
are given as (0, 1,1). If x; = 0, x = [0,1,1,0]. However, an ambiguity arises 
on the value of x; because there is no way to infer x; only from the parities. The 
other solution x = [1,0,0, 1] is also valid. We have always two valid solutions: 
(i) the correct one; and (ii) its flipped counterpart. To resolve this ambiguity, we 
should relax the goal as follows: decoding x or x © 1 from y,’s. Here ® indicates 
the bit-wise modulo sum. 


Two challenges Given the relaxed goal, solving the problem may appear to be 
not too difficult, as suggested by the example in Fig. 3.2. However, in the era of big 
data, two challenges arise. Firstly, in many applications, such as social networks, the 
number of nodes (i.e., users) can be very large. For example, as of December 2022, 
the number of Facebook users has reached 2.96 billion (Meta, 2022). Therefore, 
we may only have access to a portion of the parities. Note that the number of all 
possible pairs is huge: (5) © 4.38 x 1018 in the above example. 

In situations where we are allowed to choose any pair of two nodes, community 
detection is straightforward. For instance, by selecting a set of consecutive pairs like 
(9125.239.34 « - -> Y(n-1)n)» we can easily decode x up to a global shift. However, this 
approach can still be challenging due to the second challenge. In many applications, 
such as Meta’s social network, similarity relationships are passively given according 
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facebook 


Figure 3.3. The number of facebook users is ~2.96 billion as of December 2022. 


to acontext. This means that such information is not obtained by our own choice. 
For instance, the friendship information in Meta’s social network is simply given by 
the context. It is not possible to ask an arbitrary pair of two users whether they are 
friends or not. Therefore, one natural assumption is that the similarity information 
is given in a probabilistic manner. For example, the parity of a pair of any two users 
is given with probability p, independently across all the other pairs. 


An information-theoretic question Although community detection is a chal- 
lenging task due to the limited and probabilistic nature of pairwise measurements, 
there is good news. The good news is that, as demonstrated in the example in 
Fig. 3.2, it may not be necessary to observe every measurement pair in order to 
decode x. Since the pairs are highly dependent on each other, partial pairs might be 
enough to achieve this goal. An information-theoretic question arises: is there a fun- 
damental limit on the number of measurement pairs required to enable detection? 
Interestingly, similar to the channel capacity in communication, a phase transition 
occurs. There exists a sharp threshold on the number of pairwise measurements 
above which reliable community detection is possible and below which it is impos- 
sible, regardless of any method used. For the rest of this section, we will investigate 


this threshold in detail. 


Translation to a communication problem Under the partial and random 
observation setting, yy is statistically related to x; and x;. Hence, community detec- 
tion is an inference problem, suggesting an intimate connection with a communi- 
cation problem. Translating community detection into a communication problem, 
we can come up with a mathematical statement on the limit. 
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Figure 3.4. Translation to a communication problem. 


Let us first translate the problem as below. See Fig. 3.4. One can view x as a 
message that we wish to send and £ as a decoded message. What we are given are 
pairwise measurements. A block diagram at the transmitter converts x into pairwise 
measurements, say x;'s. Here xj; := x; ® xj. One can view this as an encoder. Since 
we assume that only part of the pairwise measurements are observed at random, 
we have another block which implements the partial & random measurements to 
extract a subset of x;’s. One can view this processing as the one that behaves like a 
channel where the output yj admits: 


= Xij W.P. D3 
Ji e wp.l—p 


where p denotes the observation probability and e indicates empty information 
(erasure). In other words, the measurement process can be modeled as an era- 
sure channel with erasure probability 1 — p. These y,’s are then fed into an 
algorithm block, thus yielding x. The algorithm block can be interpreted as a 
decoder. 


Performance metrics & an optimization problem As in the communication 
setting, we can think of two performance metrics. The first refers to a quantity that 
we are interested in characterizing the limit on. That is, the number of pairwise 
measurements that are observed, namely sample complexity. In this problem context, 
the sample complexity would be concentrated around: 


sample complexity —> C) as n —> OO. 


This is because of the WLLN. The second refers to a metric conventionally 


employed in the context of inference problems. That is, the probability of error 
defined as: 


P, := P (£ ¢ {x,x ® 1}). 


An error occurs when x 4 xand £ £x @ 1. 

There must be a tradeoff relationship between the sample complexity and P}. 
The larger the sample complexity, the smaller P, and vice versa. Hence, as Shannon 
did in the communication problem, we can formulate the following optimization 
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Figure 3.5. Phase transition in community detection. If S is above the minimal sample 


complexity S* = algin, we can make Pe arbitrarily close to O as n tends to infinity. If 


S < S*, Pe cannot be made arbitrarily close to O no matter what we do and whatsoever. 


from which the tradeoff relationship can be characterized. Given p and n, 


P*(p n):= min P,. (3.1) 
algorithm 
Remember the optimization problem that Shannon formulated earlier. It was a dif- 
ficult non-convex problem which has been open even thus far. Here we see the same 
thing. The above problem (3.1) is very difficult; hence, the exact error probability 
P* (p, n) is still open. 


Phase transition But we have a good news. The good news is that as in the 
communication setting, phase transition occurs w.r.t. the sample complexity in the 
limit of n. If sample complexity is above a threshold, one can make P, arbitrarily 
close to 0 as n —> œ; otherwise (i.e., if it is below the threshold), P, > 0 no 
matter what we do and whatsoever. In other words, there exists a sharp threshold 
on the sample complexity which determines the boundary between possible vs. 
impossible detection. This sharp threshold is called the minimum sample complexity 
S*. See Fig. 3.5. 


It turns out the minimal sample complexity reads: 


nlna 
2 


= 


Notice that S* is much smaller than the total number of possible pairs: (5) ~ 4. 
This result implies that the limit on the observation probability is 
, ee 
$ = ae oe 
(3) 


2 
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Inn 
n 


eh k =| : 
where the second equality is because limps oo ane ) = 1. Notice that # = 
vanishes as 7 —> OO, meaning that community detection requires only a negligible 


fraction of pairwise measurements for successful community detection. 


Look ahead In the next section, we will prove the achievability of the limit: 


Inn 
p> — = P > 0asn— ov. (3.2) 
n 
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3.2 Social Networks: Achievability Proof 


Recap In the previous section, we embarked on Part III which focuses on the 
application of information theory to data science. We discussed a specific applica- 
tion known as community detection, which involves identifying similar communi- 
ties. Given two communities and a set of users with community memberships x; € 
{0, 1}, the goal is to decode the community membership vector x = [x1,..., xy]. 
We used parity information, i.e., whether two users belong to the same community, 
as the measurement information. However, since x and its flipped version x @ 1 are 
indistinguishable from the parities y;’s alone, we included decoding both as a suc- 
cess event. We considered a setting where only a subset of pairwise measurements 
yi’ 8 are accessible and the pairs are chosen randomly without our control, motivated 
by big data applications such as Meta’s social networks. This led us to ask whether 
there exists a fundamental limit on the number of pairwise measurements needed 
to make community detection possible. We claimed that there is indeed a limit, 
and we explored what this means preciesely by translating it into a communication 
problem, as shown in Fig. 3.6. 

We introduced two performance metrics: (i) sample complexity (concentrated 
around G); and (ii) the probability of error P, := P(x ¢ {x,x ® 1}). We then 
claimed the minimum sample complexity (above which one can make P, — 0, 
under which one cannot make P, —> 0 no matter what we do and whatsoever) is: 


nlna Inn 
S= , Le, p“ = —. 
2 P n 


Outline In this section, we will prove that p* is achievable: 


Inn 
p> — = P, > Oasan > oo. (3.3) 
n 


It consists of three parts. First, we will employ the maximum likelihood (ML) 
decoding to derive the optimal decoder. We will then analyze the error probability 
under the ML decoding rule. Using a couple of bounding techniques, we will derive 
an upper bound of the error probability instead of attacking the exact probability 


(encoder) (decoder) 


algorithm 
block 


pairwise 
info 


&> 


Figure 3.6. Translation of community detection into a communication problem. 
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Inn 


directly. Lastly we will show that as long as p > =, the upper bound approaches 
0 as the number 7 of users tends to infinity, thereby proving the achievability. 


Encoder The encoder converts x into xj := x; ® xj. See Fig. 3.6. Unlike the 
communication setting in which the encoder is subject to our design choice, it is 
not of our design, but it is given by the context. Let X be the output matrix of size 
n-by-n which contains x;’s as its entries. As in the communication setting, we call it 
a codeword (encoder output). Obviously X is symmetric as xj; = x;;. For instance, 
when x = (1000) and x = 4, 


X(x) = (3.4) 


We omit diagonal components and symmetric counterparts as they can be trivially 
inferred. Let us call, a collection of codewords X(x)’s, codebook. 


The optimal decoder Let Y = [y,]. The codebook is assumed to be known at 
the decoder. This assumption makes a trivial sense because the structure of pairwise 
measurements is revealed. The decoder employs an optimal decision rule: the MAP 
rule. In this setting, we have no idea on the statistics of x. So we consider the 
worst-case scenario in which x is uniformly distributed. Notice that the randomness 
of x, quantified as its entropy H(x), is maximized when «x;’s are i.i.d. each being 
uniformly distributed. In this setting, the MAP becomes equivalent to the ML 
decoder: 


Xm_ = arg max P(Y|X(x)). 
The calculation of P(Y|X(x)) is straightforward, being the same as the one in the 


communication setting. For instance, suppose n = 4, x = (0000), and y;’s are all 
zeros except (13,914) = (e, e): 


0 ee 
Y(x) = o (3.5) 
Then, 
P(Y|X(0000)) = (1 = p)°p*. 


The number 2 marked in red indicates the number of erasures. This message (0000) 
is compatible since the corresponding likelihood is not zero. On the other hand, for 
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x = (1000), 
P(Y|X(1000)) = 0. 


Note that x12 = 1 (underscored in (3.4)) is different from y12 = 0 (underscored 
in (3.5)), thus forcing the likelihood to be 0. This implies that the message x = 
(1000) can never be a solution, meaning that it is incompatible with Y. Hence, the 
ML rule is summarized as follows. 


1. Eliminate all the messages incompatible with Y. 
2. If there is only one survival, declare it as the correct message. 


However, this procedure is not sufficient to describe the ML decoding rule. We 
may have a different erasure pattern that confuses the rule. To see this clearly, con- 
sider the following example. Suppose that y;;’s are all zeros except (12,913.14) = 
(e, e, e): 


Y(x) = (3.6) 


Then, 


P(¥[X(0000)) = (1 — p)3p*; 
P(Y[X(0111)) = (1 = p)°p%. 


The two patterns (0000, 0111) are compatible and the likelihood functions are 
equal. In this case, what we can do for the best is to flip a coin, choosing one out of 


the two in a random manner. This forms the last step of the ML decoding rule. 


3. If there are multiple survivals, choose one randomly. 


A setup for analysis of the error probability For the achievability 
proof (3.3), we analyze the probability of error when using the ML decoder. Starting 
with the definition of P, and using the total probability law, we get: 


P, := P (x g {x,x ® 1}) 


= 5S Pe = a)P(« ¢ {a,a® 1}|x = a). 


For a fixed a, P(x ¢ {a, a ® 1}|x = a) is a sole function of the likelihood which 
depends only on erasure patterns of the (5) independent channels. Also, the erasure 
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patterns are independent of the channel input affected by a. Therefore, the proba- 
bility is irrelevant of what the value of a is. Applying this to the above, we get: 


P,=PE¢ {0, 1}|*x = 0) 


< >) P@=alx =0) (3.7) 
a¢{0,1} 


where the inequality comes from the union bound. 


Further upper-bounding Consider P(« = a|x = 0). To gain insights, con- 
sider an example where n = 4 and a = (1000). In this case, the error event implies 
that X(1000) must be compatible. A necessary condition for X(1000) being com- 
patible under x = (0000) is: (12,135.14) = (e, e, e). For all (2,7) entries whose 
values are different between the two codewords X(0000) and X(1000) (that we call 
distinguishable positions), erasures must occur; otherwise, (1000) cannot be compat- 
ible as its corresponding likelihood would be 0. Hence, we get: 


P(& = (1000)|x = 0) 


lA 


P (X(1000) compatible|x = 0) 


lA 


P ((712: 713714) = (e, e, e)|x = 0) 
(1— py. 


The number 3 marked in red indicates the number of erasures that must occur in 


those distinguishable positions. 

A key to determine the above upper bound is the number of distinguishable 
positions between X(a) and X(0). One can easily verify that the number of dis- 
tinguishable positions (w.r.t. X(0)) depends on the number of 1’s in a. To see this, 
consider an example of a = (11---100---0). In this case, 

KJ SS 


X(a) = 


O = m= m= 
O O m= m m= 


where k denotes the number of 1’s in a. Each of the first & rows contains (n — k) 
ones; hence, the total number of distinguishable positions w.r.t. X(0): 


# of distinguishable positions = k(n — k). 
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For a succinct description of the summation term in (3.7), we classify the 
instance a depending on the number of 1’s in a. To this end, we introduce: 


Ag := {alllalli = k} (3.8) 


where ljali := |ai| + |a2| + --- + |a,|. Using this notation, we can then 
express (3.7) as: 


n—1 
PAD > C-p? 


k=l ac A, 
n—1 
= >) Aa - pyr 
k=1 
n—1 j 
= ( A) (1— pyre (3.9) 
=] 


where the second last step follows from the fact that (7) (1 — p) OTP is symmetric 
around k = n/2. 


The final step of the achievability proof Since we intend to prove the achiev- 
ability when p > ha, focus on the regime: 2 > 1 where À is defined such that 
= ea Then, it suffices to show that for A > 1, 


n 


> (ja — pk”) — Oas n> oo. (3.10) 


k=1 


In this setting of p = 1 inn, pis arbitrarily close to 0 in the limit of 7. This motivates 
us to employ the following upper bound on 1 —p (which is very tight in the regime): 


l—-p<e?. (3.11) 
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Check the proof in Prob 7.1. When z is large, the following bound is good enough 
to prove the achievability: 


n ~1)(n—2)---(n-k+1 
(;) _ n(n — 1)(n ? (n ) < r < nt (3.12) 
Applying the bounds (3.11) and (3.12) into (3.9), we obtain: 
2 n 
P, <2 ( Ja — pyr’) 
k 
k=1 
5 
22 3 ee Pk) 
k=1 
(3.13) 


ia 2 ln n—2k(1-£) Inn 
=) e ý 


5 
-2 ` eH 4)-1) Inn 
k=1 


where (a) follows from n° = é” and p:= Aine, 
For the range of 1 < k < 5,1 — k (marked in blue in (3.13)) is minimized at 
$. Applying this to the above, we get: 


P, < DA a (3.14) 


Case I: à > 2: If A > 2, we can apply the well-known summation formula w.r.t. 


§—1)Inn 


the geometric series where the common ratio is e7 , thus obtaining: 


—(4-1)Inn 


naD (e agia) TE ans me. 
1— e Ġ-Dlnz 


Hence, we can make P, > 0 for the case of 2 > 2. 
Case II: 1 < À < 2: Observe in the last step in (3.13) that 


i(i-2)-1=0 => b= (1-5) 
n A 


and A(1 — £) — 1 is a decreasing function in k. See Fig. 3.7. 
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Figure 3.7. Behavior of A(1— K) —1ink. 


This together with a choice ofa < 1 — + gives: 


n 


i(1-2)-1>0 when & < an. 


Applying this to (3.13), we get: 


paas (0 — pytob) 


an 5 
—kA—#£)-1) Inn n k(n—h) 
<2 > n + > 1— . 
z = f (C) p) 


k=an+1 


Again applying the summation formula of the geometric series to the first term in 
the last step of the above, the first term vanishes as n > oo. Hence, we obtain: 


n 


2 n 
P, < > _ p\k(n—k) 
k=an+1 


n 


(2) aa 2 n 

< -p0 > () 
k=an+1 

b 

i l — p)" 2” 


9 74a —a)nln n . PDZ 


where (a) is due to the fact that k(n — k) > a(1 — a)n? (where the equality holds 
when k = an + 1); (6) comes from the binomial theorem (>°7_9 (3) = 2”); and 
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(c) comes from the fact that 1 — p < e ? and p := aee, The last term in the 
above tends to 0 as n —> oo. This is because 2a (1 — a)nln n grows much faster 
than zln 2. This implies P; —> 0, which completes the achievability proof (3.3). 


Look ahead Using the ML decoding rule, we proved the achievability of the 
community detection limit (3.3). It turns out the converse holds: p > na for 


reliable detection, meaning that the condition of p > lny is necessary for reliable 


detection. In the next section, we will prove the converse. 
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3.3 Social Networks: Converse Proof 


Recap In the previous section, we proved the achievability of the limit on obser- 
vation probability p for community detection: 


Inn 
p> — = P> 0 an ow. 
n 


Inn 
n 
for reliable community detection, meaning the converse (the other way around) 


holds: 


Outline In this section, we will prove that the condition p > = is also necessary 


Inn 
p< — = P+ 0. 
n 


Proof strategy The converse proof relies on a lower bound of P, which does not 
vanish under the condition of p < na, In Part II, we learned about one impor- 
tant inequality that played a significant role in deriving such a lower bound. That 
is, Fano’s inequality. However, Fano’s inequality does not yield such a good lower 
bound in the context of community detection. Check in Prob 7.4. Hence, we will 
take a different approach. 

The different approach builds upon another important concept that has been 
extensively employed in the graph theory literature. That is, graph connectivity. We 
say that a graph is connected if there exists a path (i.e., a sequence of connected 
edges) between any pair of two nodes. Otherwise, it is said to be disconnected. See 
Fig. 3.8 for a pictorial illustration. 

The graph connectivity has a close relationship with an error event of community 
detection. Suppose that an edge in the graph indicates a situation where a pairwise 
measurement of the associated nodes is obtained. Then, the graph disconnectiv- 
ity implies that there exist(s) isolated node(s) (node 1 in the example illustrated in 
the right side of Fig. 3.8). In this case, there is no way to decode x even up to a 


connected disconnected 


Figure 3.8. Graph connectivity: (Left) An example of a connected graph; (Right) An 
example of a disconnected graph. 
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x = (00000) SEJE ; 
connectivity > el a | Ae 
incompatible 


Figure 3.9. A necessary condition for graph connectivity. 


global shift. Note in the example that there are four possible candidates for a solu- 
tion. For instance, suppose (723, 345/45) = (0,0,0). Then, the four candidates 
are: (i) x = [0,0,0,0,0]; Gi) x = [0,1,1, 1,1]; (iii) x = [1, 0, 0, 0, 0]; and (iv) 
x = [1,1,1,1, 1]. Obviously this does not ensure successful community detection, 
hence P, + 0. Therefore, it suffices to show that 


l 
p< D 3 graph is disconnected, i.e., P(connected) — 0. 
n 


An upper bound of connectivity probability P(connected) Since we 
intend to show P(connected) — 0, it suffices to prove that its upper bound tends 
to 0 as n —> oo. The graph connectivity has nothing to do with the value of x. 
Hence, without loss of generality, one can assume that the ground truth x = 0. 
This then gives: 


P(connected) = P(connected|x = 0). 


The event of a graph being connected implies that node 1 is connected with 
at least one different node; otherwise, node 1 is isolated. This suggests that there 
exists at least one observation in the first row of the received signal matrix Y. In 
the example of Fig. 3.9, (12,15) are observed as (0, 0). This observation is com- 
patible with the all-zero ground truth vector. On the other hand, the codeword 
w.r.t. the message (10---0) (X(10---0)) is incompatible with Y since the revealed 
components in Y contradict with corresponding components in X(10---0). In 
the example of Fig. 3.9, (12,15) = (0, 0) do not match with (x12, x15) = (1, 1). 
Similarly X(010 - - - 0), ...,X(0 - - - 01) are all incompatible with Y. Using this and 
the following set A; := {a : ||ællı = 1}, we can rewrite P(connected) as: 


P(connected) = P(connected|x = 0) 


(3.15) 


lA 


P N {X (a) incomp.} 
acA 
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A key upper-bounding technique One key observation is that the computa- 
tion of the above probability can be greatly simplified when the associated events are 
independent. Remember that P(AN B) = P(A)P(B) for independent events A and 
B. Also the more independent events are, the tighter bound we get. This motivates 
us to search for independent events (among 7 events) as many as possible. 

To this end, we intend to see the functional relationship between X(a) and era- 
sure patterns. Only the first row in X(10---0) are distinct with those in X(0). 
This implies that whether X(10- - - 0) is zxcompatible depends solely on the pair- 
wise measurements y,’s in those distinguishable positions. Similarly the event of 
X(010- -- 0) being incompatible depends on y;;’s in the second row. Here the sym- 
metric matrix property gives x12 = x21, hence this may pose dependency between 
the two events. 

However, the dependency can be removed for certain situations. Suppose the 
following event occurs: y12 = e. Then, the two events share no overlapping posi- 
tions, since the overlapping (1,2) entry is now erased. Hence, given y12 = e, the 
two events become independent. Similarly, given y; = e for all 7,7 € {1, 2, 3}, 


{X(10--- 0) incomp.}1{X(010---0) incomp.}-L{X(0010 --- 0) incomp.}. 


This enables us to identify a general erasure pattern that makes mutiple events (say 
L events) independent: 


yee Yije {1,2,..., L} => 


{X(10---0) incomp.}L{X(010 - - - 0) incomp.}L 
--- L{X(0--. 1 -+ 0) incomp.}. 
=~ 


Lth position 


In view of graph, the number Z refers to the number of nodes that are locally dis- 
connected. In the example of Fig. 3.10, we have no observation for any pair of 
nodes 1,2,3,4,5. So L > 5 in this case. 

This motivates us to find the maximum number of nodes that are locally dis- 
connected. It reveals as many as independent events, thus leading to a tight upper 
bound on the connectivity probability. To this end, we consider the following sit- 
uation. Suppose we consider the first m nodes in the graph. We will choose m such 
that Z is maximized and m tends to 00 as n —> oo. Then, as per the WLLN, the 
number of edges in the subgraph consisting of the m nodes would be concentrated 


C) w.h.p. 


around 
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no edge among the 5 nodes 


12345 

1 eleļejel* * 

2 elele|*|*|* 

3 eje/*|*|* e 
j se] [SA S 
5 wee] a 


~n 


Figure 3.10. An example in which the number of nodes locally disconnected is greater 
than or equal to 5. 


Using the fact that one edge is associated with two nodes, the number of locally 
disconnected nodes is at least m — 2(%)p: 


aleh] 


Let p = jinn for some À < 1. Then, 


L> | m- ime 2”, 


n 


In an effort to maximize the above bound, we choose m such that the first term 


2lnz 


m and the second term Am in the bound are of the same order. We make a 


particular choice for such m: m = | zf; |, thus obtaining: 


el G-3A¥jE] 


where a := 5 — 7. We then reorder node indices such that the locally disconnected 


2 
nodes are numbered as 1,2,..., | J. Let T be such an event: 


r= [12 [2H 
Inn 
We call it a typical event as it happens w.h.p.: 


P(T)—> 1 asn> ow. 
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Given the typical event 7, 


{X(10---0) incomp.}1{X(010 - - - 0) incomp.} 
++» L{X(O--- ues, * - 0) incomp.}. (3.16) 


LE Jth position 


Applying this into (3.15) together with the total probability law, we get: 


P(connected) = P(connected|x = 0) 


lA 


P N {X (a) incomp.}|x = 0 
acA 


P N {X (a) incomp.} ¢ , 7 |x = 0 


acAı 
+P N {X (a) incomp.} ¢ , 7 ‘|x = 0 (3.17) 
acA\ 
(a) . 
x P N {X (a) incomp.}|x = 0, T 
acA; 


4 I] P (X(a) incomp.|x = 0, T) 


acı 


O p (X(10- -- 0) incomp.|x = 0, T) 


where (a) follows from P(T) > 1,P(T°) — 0 and (b) comes from (3.16) and 
the definition Bı := {b : ||óllı = 1,4; = 1,7 = 1,2,..., | J}; and (c) is by 


Inn 


symmetry. 


The final step The event of X(10---0) being incompatible given (x = 0,7) 
an 


implies that there exists at least one observation among the last a — |i; 


components in the first row of Y. Hence, 


P(X(10---0) incomp.|* = 0,7) 


< 1— (1 =p) bine! 
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<1- (1 -p) 
2 = 


where (a) is due to the fact that 1 — x < e* for x > 0. Applying this to (3.17), 
we get: 


P(connected) < P(X(10---0) incomp.|x = 0, T) linn 


con (-0-p"| 2) 


(a) a 
X exp (e 2) 


Inn 


= exp (-ae0- ln n—ln In 3) 
where (a) comes from the fact that (1 — p)” ~ e~?” for sufficiently large n and small 
p= abe (which is our case) and the fact that | %5] ~ g5 for large n. Therefore, 


Inn 


if A < 1, the upper bound goes to 0 as n —> oo. This completes the proof. 


Look ahead We have thus far proved the achievability and converse of the com- 
munity detection limit. Remember that the achievable scheme is the ML decoding 
rule: 


Xm_ = arg max P(Y|X(x)). 


A practical issue arises with the implementation of the ML rule due to its prohibitive 
complexity. The number of likelihood computations required for the rule is 2” and 
grows exponentially with the number of users 2, which is typically very large in 
practice. As a result, the complexity becomes enormous. However, there is another 
algorithm that is much more efficient and provides nearly the same performance as 
the ML rule. In the following section, we will examine this efficient algorithm. 
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3.4 An Efficient Algorithm and Python Implementation 


Recap In Section 3.2, we employed the ML decoding rule to prove the achiev- 
ability of the limit p* = = for community detection. Recall the ML decoding 


rule: 
Xm_ = arg max P(Y |X (x)) 
x 
where x = [x1,...,Xn]/ indicates the community membership vector; X(x) 
denotes a codeword matrix taking x; = x; ® x; as the (7,7) entry; and Y is an 


observation matrix with y;’s (yj; = x; w.p. p and e otherwise). One critical issue in 
the ML rule is that its complexity is significant. Since x takes one of the 2” possi- 
ble patterns, the ML rule requires the number 2” of likelihood computations that 
grows exponentially with 7. Hence, it is crucial to develop a computationally effi- 
cient algorithm that possibly yields the same performance as the ML rule. Indeed, 
efficient algorithms have been developed that achieve the optimal ML performance. 


Outline In this section, we will examine one such efficient algorithm for commu- 
nity detection. Although the optimal algorithm is quite complex, we will focus on 
its simpler version that achieves sub-optimal performance while still including the 
main components of the algorithm. We will cover four main points in this section. 
First, we will introduce the adjacency matrix, which plays a fundamental role in 
the algorithm. Second, we will explain how the algorithm works, including the 
process of finding the principal eigenvector of the adjacency matrix. Third, we will 
explore an efficient method of computing the principal eigenvector, known as the 
power method. Finally, we will provide a Python implementation of the spectral 
algorithm. 


Adjacency matrix Since the number z of users is often huge, pairwise measure- 
ments are big data although only part of them are observed. In data science, there is 
a useful entity that represents such big data in a succinct way. That is, the adjacency 
matrix (Chartrand, 1977). The adjacency matrix, say A, is an equivalent repre- 
sentation of the pairwise data where each column and row represents users. Each 
entry, say aj, indicates whether users 7 and j are in the same community. Precisely, 
aj; = +1 means that the two users are in the same community; a; = —1 indicates 
the opposite; and a = 0 denotes no measurement: 


1 — 2y;, w.p. p; 
“=| eee (3.18) 


0, otherwise (y; = e). 
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For instance, when x = [1, 0,0, 1]”, we might have: 


+1 -1 0 0 


—-1 +1 +1 -1 
A= (3.19) 
0 +1 +1 =1 


0 —1 —-1 +41 


where we have erasures for (713,914). 
In the full measurement setting (p = 1), one can make an observation that gives 
a significant insight into algorithms. When p = 1, we would obtain: 


+1 -1 =] +I 


—1 +1 +1 —1 
A= ; (3.20) 
—1 +1 +1 -1 


+1 -1 -1 +1 


Note that the rank of A is 1, i.e., each row is a linear combination of the other 
rows. Can we extract the community pattern from this rank-1 matrix? It turns out 
the answer is yes, and this forms the basis of the spectral algorithm that we will 
investigate in the sequel. 


Spectral algorithm (Shen et a/., 2011) Using the fact that the rank of A (3.20) 
is 1, we can easily derive its eigenvector. 


+1 


—1 
v= i (3.21) 
—1 


+1 


Check that Av = áv indeed. Here the +1 (or —1) entries of v tell us the commu- 
nity memberships of the users. Therefore, in the ideal situation where every pair is 
sampled, the principal eigenvector (the sole eigenvector) recovers the communities. 
This approach, taking the adjacency matrix and computing its principal eigenvec- 
tor, is called the spectral algorithm. 

What about for the partial measurement case p < 1? In this case, we are not 
clear if the principal eigenvector is able to return the community memberships. 
Also, the components in the eigenvector may not necessarily take +1 (or —1). To 
address the second issue, we may take a thresholded eigenvector, say vin, where its 
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entry takes the sign of v;: 


+1, vu; > 0; 
Uh = f (3.22) 
—1, otherwise. 


As long as p is big enough, vin returns the ground truth of communities, as 7 tends 
to infinity. Due to the interest of this book, we will not analyze how big p is required 
for successful recovery of the spectral algorithm. Instead we will later provide empir- 
ical simulation via Python to demonstrate that for a large value of 7, vin indeed 
approaches the ground truth with an increase in p. 


Power method (Golub and Van Loan, 2013) Another technical question 
arises when it comes to computing the principal eigenvector. What if the adja- 
cency matrix is of a big size? Remember that the order of is around 10° in Meta’s 
social networks. A naive way of computing the eigenvector based on eigenvalue 
decomposition requires the complexity of around 7°. Hence, this way is prohibitive. 
Fortunately, there is one very efficient and useful way of computing the principal 
eigenvector. That is, the power method. The method is well-known and popular in 
the data science literature. 

Prior to describing how it works in detail, let us make important observations 
that naturally lead to the method. Suppose that the adjacency matrix A € R”*” 
has m eigenvalues /;’s and eigenvectors v;’s: 


A:= Aivive + Avv? +-+ Ammi 


where A, > Az > 13 > --+ > Amand v;’s are orthonormal: viv; = 1{i = j}. By 
definition, (41, v1) are the principal eigenvalue and eigenvector respectively. Let 
v €e R” be an arbitrary non-zero vector such that v/v # 0. Then, Av can be 
expressed as: 


m 
Av = Saw? v 


i=1 


= < Avi vv; 
Duly; 


i=1 


(3.23) 


where the second equality is due to the fact that v/ v is a scalar. 
The first term 2, (vT v)vı in the above summation forms a major contribution, 
since A, is the largest. This major effect becomes more dominant when we multiply 
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the adjacency matrix to the resulting vector Av. To see this, consider: 


A’y = A(Ay) 


— (È Awi? ) 5 Ay(v/ v)vj 
i=l j=l 


(3.24) 
@ » Riv} v)(v) vvi + > hihi (vi Ww wy 

i=] ij 
o > 22 vi vy; 

i=1 


where (a) comes from the fact that vi v; is a scalar; and (0) v/s are orthonormal 
vectors, i.€., viv; = l and viv = 0 fori Æ j. The distinction in (3.24) relative 
to Av (3.23) is that we read a instead of A;. So the contribution from the princi- 
pal component is more significant relative to the other components. Iterating this 
process (multiplying A to the resulting vector iteratively), we get: 


In the limit of &, 
Afv ” o 
——— = — |] 4-vj— v, ask> co. 
Lewy) 2i Ay viv 


This implies that iterating the following process (multiplying A and then normal- 
izing the resulting vector), the normalized vector converges to the principal eigen- 
vector: 

Afv 


— — v ak> oœ. 
|A‘v||? 


This observation leads to the power method: 


1. Choose a random vector v and set v = v and + = 0. 
Av 
|| Av) || 


3. Iterate Step 2 until converged, e.g., Iwo) — vt) j < e= 107. 


CD 


2. Compute v and increase ¢ by 1. 
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The power method requires multiple (say &) matrix-vector multiplications, each 
having the complexity of n? multiplications. Hence, the complexity of the power 
method is still on the order of n*, as long as the number & of iterations is not so 
large relative to 7 (this is often the case in practice). This is much smaller than the 
complexity n° of the eigenvalue decomposition, especially when 7 is very large. Due 
to this computational benefit, the power method is widely employed as an efficient 
algorithm for finding the principal eigenvector in many applications. 


Python implementation of the spectral algorithm We implement the spec- 
tral algorithm via Python. We first generate the community memberships of 7 users. 


from scipy.stats import bernoulli 
import numpy as np 


n = 8 # number of users 

Bern = bernoulli¢0.5) 

# Generate n community memberships 
x = Bern.rvs(n) 

print(x) 


[00001010] 


We then construct random pairwise measurements. 


# Construct the codebook 
X = np.zeros((n,n)) 
for i in range(len(x)): 
for j in rangedi,len(x)): 
# Compute xij = xi + xj (modulo 2) 
XLij] = Oi] +xfj]) % 2 
# Symmetric component 


XU] = X[ij] 

print(X) 

[[O. O. O. O. 1. O. 1. O.] 
[O. 0. 0O. O. 1. O. 1. O.] 
[O. 0. 0O. O. 1. O. 1. 0.] 
[O. 0. 0. O. 1. O. 1. 0.) 
Ete 1. 1. 1. 0. 1. 0. 1:1 
[O. 0. O. O. 1. O. 1. 0.) 
Els 1. 1. 1. 0. 1. 0. 1.4 
[O. O. O. O. 1. O. 1. 0.]] 


Next we compute the adjacency matrix. 
# observation probability 
p=0.8 
obs_bern = bernoulli(p) 
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# Construct an n-by-n mask matrix: 


# entry = 1 (observed); O (otherwise) 


mask_matrix = obs_bern.rvs((n,n)) 


# Construct the adjacency matrix 
A = (i-2*X)*mask_matrix 


print(1-2*X) 
print(mask_matrix) 


print(A) 
[E 1. 1. 
[1. 1. 
Lode ol 
[ 1. 1. 
[-1. -1. 
[ 1. 1. 
[-1. -1. 
[ 1. 1. 
[[1 100 
[111090 
[1101 
[1111 
[1111 
[1111 
[1111 
[101 1 
[E 1. 1. 
LR Te 
[1. 1. 
[1. 1. 
[-1. -1. 
[1. 1. 
[-1. -1. 
[ 1. 0. 


We run the power method to compute the principal eigenvector of A. 


me E mee a e ee 


ee ee © a O 
a op ee a a O 


= 546-4 4-00 


a es ee ee ee al 


1. 
1. 
=i 
1. 
sii 
1. 


ee ee A) 


=i 


1. 


def power_method(A, eps=le-5): 


# A computationally efficient algorithm 
# for finding the principal eigenvector 
# Choose a random vector 


v = np.random.randn(n) 
# normalization 
v = v/np.linalg.norm(v) 


prev_v = np.zeros(len(v)) 


— E ee ee E 


=s a oi i i i i i 


a ai a a a r T 


ti ji aat a N E a 


] 
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t=O 
while np.linalg.norm(prev_v-v) > eps: 
prev_v=v 


v = np.array(np.dot(A,v)).reshape(-1) 
v = v/np.linalg.norm(v) 
t+=1 
print(" Terminated after %s iterations"%t) 
return v 


vl = power_method(A) 
print(v1) 
print(np.sign(v1)) 
print(1-2*x) 


Terminated after 8 iterations 

[ 0.25389707 0.29847768 0.35720613 0.34957476 -0.40941936 
0.35720737 -0.40941936 0.36579104] 

[ 1. 1. 1. | 

[1 1 1 1 -1 1 -1 1J 


In the above experiment, the thresholded principal eigenvector np.sign(v1) coin- 
cides with the ground truth community vector 1-2*x. 


Python: Performance of the spectral algorithm We will demonstrate via 
Python experiments that the principal eigenvector is getting closer to the ground 
truth of the community memberships as p increases. Consider a practical scenario 
in which z is large, say n = 4000. To measure the similarity between the prin- 
cipal eigenvector and the community vector, we employ a well-known correlation 
measure, called the Pearson correlation (Freedman et al., 2007): 


OXx,Y 
PX,Y := — (3.25) 
OxOy 


where oyy = ELXY] — ELXJE[Y], ox = VE[X?] — (E[X])2, and oy = 
VE[Y2] — (E[Y])2. To compute the Person correlation, we use a built-in func- 


tion pearsonr defined in the scipy.stats module. 


from scipy.stats import bernoulli 
from scipy.stats import pearsonr 
import numpy as np 

import matplotlib.pyplot as plt 


n = 4000 # number of users 

Bern = bernoulli(0.5) 

# Generate n community memberships 
x = Bern.rvs(n) 
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# Construct the codebook 
X = np.zeros((n,n)) 
for i in rangeclencx)): 
for j in rangeci,len(x)): 
# Compute xij = xi + xj (modulo 2) 
XLi j] = (xLi]J+x[j]) % 2 
# Symmetric component 
X[j,i] = X[i,j] 


p = np.linspace(0.0003,0.0025,30) 
limit = np.log(n)/n 
p_norm = p/limit 


def power_method(A, eps=1e-5): 
# A computationally efficient algorithm 
# for finding the principal eigenvector 
# Choose a random vector 
= np.random.randn(n) 
# normalization 
v = v/np.linalg.norm(v) 


prev_v = np.zeros(len(v)) 


t=0 
while np.linalg.norm(prev_v-v) > eps: 
prev_v=v 


v = np.array(np.dot(A,v)).reshape(-1) 
v = v/np.linalg.norm(v) 
t+=1 
print(" Terminated after %s iterations"%t) 
return v 


corr = np.zeros_like(p) 


for i,val in enumerate(p): 
obs_bern = bernoulliCval) 
# Construct an n-by-n mask matrix: 
# entry = 1 (observed); O (otherwise) 
mask_matrix = obs_bern.rvs((n,n)) 


# Construct the adjacency matrix 

A = (1-2*X)*mask_matrix 

# Power method 

vl = power_method(A) 

# Threshold the principal eigenvector 
vl = np.sign(v1) 

# Compute the ground truth 


197 


198 Data Science Applications 


ground_truth = 1-2*x 

# Compute Pearson correlation 

corr[i] = np.abs(pearsonr(ground_truth,v1)[0]) 
print(p_norm{[i], corr[i]) 


plt.figure(figsize=(5,5), dpi=200) 

plt.plot(p_norm, corr) 

plt.title? Pearson correlation btw estimate and ground truth’) 
plt.gridClinestyle=":’, linewidth=0.5) 

plt.showQ 


Notice in Fig. 3.11 that the thresholded principal eigenvector is getting closer 
to the ground-truth community vector (or its flip version) with an increase in p, 
reflected in the high Pearson correlation for a large p. Especially when p is around 
the limit p* = na, the Pearson correlation is very close to 1, demonstrating that the 
spectral algorithm achieves almost the optimal performance promised by the ML 
decoding rule. This is sort of a heuristic argument. In order to give a precise argu- 
ment, we should actually rely upon the empirical error rate (instead of the Pearson 
correlation) computed over sufficiently many random realizations of community 
vectors. For computational simplicity, we employ instead the Pearson correlation 
which can be reliably computed only with one random trial per each p. 


Pearson correlation btw estimate and ground truth 
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Figure 3.11. Pearson correlation between the thresholded principal eigenvector and the 


ground-truth community vector as a function of 4 := 2 = —P _. 
p Inn/n 
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Look ahead We have delved into one particular data science application of infor- 
mation theory, namely community detection, and highlighted the concept of phase 
transition, as well as confirming its occurrence through Python simulations. Mov- 
ing forward, the next section will focus on a closely related application to commu- 
nity detection in the field of computational biology, known as Haplotype phasing. 
We will explore the nature of this problem and how it is connected to community 
detection. 
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Problem Set 7 


Prob 7.1 (Basics on bounds and combinatorics) 


(a) Let p > 0. Prove that 
l-p<e?. 


Also specify the condition under which the equality holds. 
(b) Show that for non-negative integers 7 and k (k < n): 


(;) < Aon, 
()=7 
pres 


k=0 


(c) Show that for integers n > 0: 


Prob 7.2 (The concept of reliable community detection) Suppose there 
are n users clustered into two communities. Let x; € {0,1} indicate a member- 
ship of community with regard to user 7 € {1,2,...,}. We are given part of the 
pairwise measurements: 


(" xj w.p. p; 

Jj = 

e, w.p. 1 — p, 

for every pair (4,7) € {(1,2), (1,3),..., (1, 72), (2,3),..., (n — 1,n)} and p € 
[0, 1]. Assume that y;;’s are independent over (ż, j). Given y;;’s, one wishes to decode 
the community membership vector x := [x1, x2, . . . , Xn] or its flipped counterpart 
xal := [191,2 @1,...,x%, @ 1]. Let x be an estimate. Define the probability 
of error as: 


P, = P ¢ {x,x@ 1}. 


(a) Let sample complexity be the number of pairwise measurements which are 
not erased. Show that 
sample complexity 
(3) 


(b) Consider the following optimization problem. Given p and 7, 


— p ano. 


P*(p,n):= min P,. 
e\P ) algorithm ý 
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State the definition of reliable detection. Also state the definition of minimum 
sample complexity using the concept of reliable detection. 
(c) Consider a slightly different optimization problem. Given p, 


P*(p):= min P, 
j p) algorithm,⁄ i 


The distinction here is that 7 is a design parameter. For € > 0, what are 
ps(ns + €) and P(e» — €)? Also explain why. 


Prob 7.3 (An upper bound) Let p = jinn where A > 1. Show that 


z 


2 
» ({)a — p0» — 0 asn — oo. 


k=1 
Hint: Use the bounds in Prob 71. 


Prob 7.4 (Applying Fano’s inequality to community detection) Con- 
sider an instance of community detection in which the goal is to figure out the 
community membership of each user between community 0 and community 1. 
Let x = [x1,x2,...,%n] be a collection of community memberships of 7 users: 
x; € {0,1},1 <i < n. We are given part of pairwise measurements. With proba- 
bility p, we observe x; := x; ® x; independently over all pairs (7,7) where i < j: 


f w.p. Pp 
Jij = 


e, otherwise. 


Let X = [x;] € Ree and Y = [yj] € Re 


(a) Suppose n = 3. Compute H(X) and H(Y). 
(b) Consider the following upper bound: 


HY) < >) Hl) 
i<j 


Is this bound tight? If not, derive a tighter upper bound. 
(c) Using Fano’s inequality and data processing inequality, derive the following 
necessary condition for reliable detection: 


Does the result in part (b) lead to a tighter necessary condition? If so, derive 
the necessary condition. 
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connected disconnected 


Figure 3.12. Illustration of graph connectivity and disconnectivity. 


Prob 7.5 (Erdés-Rényi random graph) Consider a random graph G that two 
mathematicians (named Paul Erdős and Alfréd Rényi) introduced in (Erdős et al., 
1960). Hence, the graph is called the Erdés-Rényi graph. The graph contains 7 
nodes and assumes that an edge appears w.p. p € [0, 1] for any pair of two nodes 
in an independent manner. We say that a graph is connected if for any two nodes, 
there exists a path (i.e., a sequence of edges) that connects one node to the other; 
otherwise, it is said to be disconnected. See Fig. 3.12 for examples. 


(a) Show that 


n—-1 
P(G is disconnected) < >, (;) a — pC», 


k=1 


Hint: Think about a necessary event for disconnectivity. 

(b) Show that if p > = P(G is disconnected) > 0 as n > ov. 
Hint: Recall the achievability proof for community detection that we did in 
Section 3.2. 

(c) It has been shown that if p < na, P(G is disconnected) —> 1 as n > o0. 
This together with the result in part (a) implies that the sharp threshold on 
p for graph connectivity is the same as the one for community detection: 


_ Inn 


n 


* 


Relate this to the fundamental limit on observation probability in commu- 
nity detection, i.e., explain why the limits are same. 


Prob 7.6 (The coupon collector problem) There are x different coupons. 
Suppose that for any cracker, the probability that the cracker contains a particu- 
lar coupon among the 7 coupons is L, i.e., the # kinds of coupons are uniformly 
distributed over the entire crackers that are being sold. 
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(a) Suppose that Alice has k(< n) distinct coupons. When Alice buys a new 
cracker, what is the probability that the new cracker contains a new coupon 
(i.e., being different from the k coupons that Alice possesses)? Let X; be the 
number of crackers that Alice needs to buy to acquire a new coupon. Show 


that 
m—1 Z 
&@=m (E) 


n n 


(b) Suppose Bob has no coupon. Let K := 5 X; indicate the number of 
crackers that Bob needs to buy to collect all the coupons. Show that 


masalit eee 
E ae aJ 


(c) Using the fact that $7} i X In vin the limit of n (check Euler-Maclaurin 
formula in wikepidia), argue that in the limit of 7, 


E[K] © alna. (3.26) 


Note: This is order-wise the same as the minimal sample complexity als 


required for reliable community detection. 


(d) Using Python, plot E[K] and 71n z in the same figure for a proper range 
of n, say 1 < a < 10,000. 


Prob 7.7 (Converse for community detection) Suppose there are 7 users 
clustered into two communities. Let x; € {0,1} indicates a membership of com- 
munity with regard to user 7 € {1,2,..., n}. Let X be a codeword matrix whose 
(7,7)-th entry is x = x; ® xj. Assume that we are given part of the comparison 
pairs: 


e, O.W. 


Xij w.p. Ps 
Jij = 


for (4,7) € {(, 2), (1, 3),---,(z — 1,n)} and p € [0,1]. We also assume that 
Jijs are independent over (7,7). Let Y be a received signal matrix whose (ż, j)-th 
entry is yj. Given Y, one wishes to decode the community membership vector 
x = [x1,22,...,x,] orx@l:= [x} 91,21 @1,...,x,@ 1]. Let x be an estimate. 
Define the probability of error as 


P, = P ¢ {x,x O1}. 
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(a) 


(b) 


(c) 
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Show that 

H(x|x) < 1 + nP.. 
Show that 

T(x;&) < IX;Y). 


Assume x;’s are i.i.d. ~ Bern(3). Using parts (a) and (b), derive a necessary 
condition on p under which P, can be made arbitrarily close to 0 as n — 00. 


Prob 7.8 (True or False?) 


(a) 


(b) 


Consider an instance of community detection with two communities. Let 
x = [x1,%2,...,X,] be the community membership vector in which 
x; € {0, 1} and z denotes the total number of users. Suppose we are given 
part of the comparison pairs with observation probability p. In Section 3.1, 
we formulated an optimization problem which aims to minimize the prob- 
ability of error defined as P, := P (£ ¢ {x, x ® 1}). Given p and n, denote 
by P*(p, n) the minimum probability of error. In Section 3.2, we did not 
intend to derive the exact P* (p, n). Instead we developed a lower bound of 
P} (p, n) to demonstrate that for any p > bs the probability of error can 
be made arbitrarily close to 0 as 7 tends to infinity. 

Consider an inference problem in which we wish to decode X € ¥ from 
Y € Y where ¥ and Y indicate the ranges of X and Y, respectively. Given 
Y = y, the optimal decoder is: 


X= arg max P(Y = y|X = x). 
xEX 
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3.5 DNA Sequencing: Fundamental Limits 


Recap In the previous sections, we proved both the achievability and converse of 
the fundamental limit on the observation probability p needed to achieve reliable 
community detection: 


Inn 
p> — SS P70 ano. 
n 


Via Python simulation, we also observed phase transition on the limit ia, The 


estimated vector via the spectral algorithm indeed converges to the ground-truth 
Inn 
n* 
Now we will move onto another data science application concerning phase 


community vector when p is close to 


transition. The application that we will focus on is w.r.t. computational biol- 
ogy. Specifically we will explore one of the important DNA sequencing problems, 
named Haplotype phasing (Browning and Browning, 2011; Das and Vikalo, 2015; 
Chen eż al., 2016a; Si et al., 2014). Interestingly, it has a close connection with 
community detection. 


Outline In this section, we will examine Haplotype phasing and explore its rela- 
tionship to community detection. The section is divided into four parts. First, we 
will investigate two relevant keywords: DNA sequencing and Haplotype. We will 
then figure out what Haplotype phasing is. Next, drawing upon computational 
biology expertise, we will establish a link to community detection. Finally, we will 
examine the sharp threshold present in this problem, as in community detection. 


Our 23 pairs of chromosome We will begin by discussing two important 
terms: (1) DNA sequence and (2) Haplotype. Our body consists of numerous cells, 
and each cell has a vital component known as the nucleus. The nucleus contains 
23 pairs of chromosomes. In each pair, one comes from the mother (maternal chro- 
mosome), while the other comes from the father (paternal chromosome). Fig. 3.13 
illustrates the 23 pairs of chromosomes. Each chromosome in a pair consists of a 
series of elements known as bases, and each base can take on one of four letters: A, 
C, T, or G. This series of bases is referred to as a DNA sequence, with a typical 
length of around 3 billion. 

Looking inside a pair of chromosome, we see an interesting sequence pattern. 
The maternal sequence is almost identical to the paternal counterpart, ~ 99.9% 
being identical. See Fig. 3.14. Differences occur only at certain positions. Such dif- 
ferences are called “Single Nucleotide Polymorphisms (SNPs)”, being pronounced 
as “snips”. It is well known that knowing the SNPs patterns is useful for personalized 
medicine. It can help predicting the probability of a certain cancer occurring. It 
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11 13 14 15 16 
bere ee 


18 19 20 21 22 XY 


source: epigenetics tutorial of Umich BioSocial Collaborative 


Figure 3.13. 23 pairs of chromosome. 


maternal sequence 

GTTCTTTGGCCGCAGCAAGGCCGCTCTCACTGCAAAGTTAACTCTGATGCGTGT GTAACACAACATCCTCCTCCCAGTCGCCCCTGTA 
GCTCCCCTCCTCCAAGAGCCCAGCCCTTGCCCACAGGGCCACACTCCACGTGCAGAGCAGCCTCAGCACTCACCGGGCACGAGCGA 
GCCCGTGTGGTGCGCAGGGATGAGAAGGCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGA 

GGGGTTTGAGAAGGCAGAGGCGCGACTGGGGTTCATGAGGAAAGGGAGGGGGAGGATGTGGGATGGTGGAGGGGCTGCAGACTCT 
GGGCTAGGGAAAGCTGGGATGTCTCTAAAGGTTGGAATGAATGGCCTAGAATCCGACCCAATAAGCCAAAGCCACTTCCACCAACGTT 
AGAAGGCCTTGGCCCCCAGAGAGCCAATTT CACAATCCAGAAGTCCCCGTGCCCTAAAGGGTCTGCCCTGATTACTCCTGGCTCCTTG 
TGTGCAGGGGGCTCAGGCATGGCAGGGCTGGGAGTACCAGCAGGCACTCAAGCGGCTTAAGTGTTCCATGACAGACTGGTATGAAG 

GTGGCCACAATTCAGAAAGAAAAAAGAAGAGCACCATCTCCTTCCAGTGAGGAAGCGGGACCACCACCCAGCGTGTGCTCCATCTTTT 
CTGGCTGGGGAGAGGCCTTCATCTGCT GTAAAGGGTCCTCCAGCACAAGCTGTCTTAATTGACCCTAGTTCCCAGGGCAGCCTCGTT 

CTGCCTTGGGTGCTGACACGACCTTCGGTAGGTGCATAAGCTCTGCAT TCGAGGTCCACAGGGGCAGT GGGAGGGAACTGAGACTG 

GGGAGGGACAAAGGCTGCTCTGTCCTGGTGCTCCCACAAAGGAGAAGGGCTGATCACTCAAAGTTGCGAACACCAAGCTCAACAATG 
AGCCCTGGAAAATTTCTGGAATGGATTATTAAACAGAGAGTCTGTAAGCACTTAGAAAAGGCCGCGGTGAGTCCCAGGGGCCAGCACT 
GCTCGAAATGTACAGCATTTCTCTTT GTAACAGGAT TAT TAGCCTGCTGTGCCCGGGGAAAACATGCAGCACAGTGCATCTCGAGTCA 

GCAGGATTTTGACGGCTTCTAACAAAATCTT GTAGACAAGATGGAGCTATGGGGGTT GGAGGAGAGAACATATAGGAAAAATCATAGC 


CAAATGAACCACAGCCCCAAAGGGCACAGTTGAACAATGGAC A = 
paternal sequence almost identical! ~ 99.9% 


sTTCTT TGGCCGCAGCAAGGCCGCTCTCACTGCAAAGTTAACTCTGATGCGTGTGTAACACAACATCCTCCTCCCAGTCGCCCCTGTA 

GCTCCCCTCCTCCAAGAGCCCAGCCCTT GCCCACAGGGCCACACTCCACGTGCAGAGCAGCCTCAGCACTCACCGGGCACGAGCGA 
GCCTGTGTGGTGCGCAGGGATGAGAAGGCAGAGGCGCGACTGGGGTTCATGAGGAAGGGCAGGAGGAGGGTGTGGGATGGTGGAG 
GGGTTTGAGAAGGCAGAGGCGCGACTGGGGTTCAT GAGGAAAGGGAGGGGGAGGATGTGGGATGGT GGAGGGGCTGCAGACTCTG 
GGCTAGGGAAAGCTGGGATGTCTCTAAAGGTT GGAATGAATGGCCTAGAATCCGACCCAATAAGCCAAAGCCACTTCCACCAACGTTA 
GAAGGCCTTGGCCCCCAGAGAGCCAATTTCACAATCCAGAAGTCCCCGTGCCCTAAAGGGTCTGCCCTGATTACTCCTGGCTCCTTGT 
GTGCAGGGGGCTCAGGCATGGCAGGGCTGGGAGTACCAGCAGGCACTCAAGCGGCTTAAGTGTTCCATGACAGACTGGTATGAAGG 

TGGCCACAATTCAGAAAGAAAAAAGAAGAGCACCATCTCCTTCCAGTGAGGAAGCGGGACCACCACCCAGCGTGTGCTCCATCTTTTC 
TGGCTGGGGAGAGGCCTTCATCTGCTGTAAAGGGTCCTCCAGCACAAGCTGTCTTAATTGACCCTAGTTCCCAGGGCAGCCTCGTTCT 
GCCTTGGGTGCTGACACGACCTTCGGTAGGTGCATAAGCTCTGCATTCGAGGTCCACAGGGGCAGTGGGAGGGAACTGAGACTGGG 

GAGGGACAAAGGCTGCTCTGTCCTGGTGCTCCCACAAAGGAGAAGGGCTGATCACTCAAAGTTGCGAACACCAAGCTCAACAATGAG 

CCCTGGAAAATTTCTGGAATGGATTATTAAACAGAGAGTCTGTAAGCACT TAGAAAAGGCCGCGGTGAGTCCCAGGGGCCAGCACTGC 
TCGAAATGTACAGCATTTCTCTTTGTAACAGGAT TATTAGCCTGCT GIGCCCGGGGAAAACAT GCAGCACAGTGCATCTCGAGTCAGC, 

AGGATTTTGACGGCTTCTAACAAAATCTTGTAGACAAGATGGAGCTATGGGGGTT GGAGGAGAGAACATATAGGAAAAATCAGAGCCA 
AATGAACCACAGCCCCAAAGGGCACAGTTGAACAATGGAC 


Figure 3.14. The maternal sequence is almost (99.9%) identical to the paternal sequence. 


determines somatic mutations such as HIV. It also serves to understand phylogetic 
trees, exhibiting relationships between a variety of distinct species. The second key- 
word “Haplotype” refers to a pair of the two sequences of SNPs. 


Haplotype phasing Haplotype phasing is the process of identifying the pair of 
two SNPs, which involves two sub-tasks: (1) identifying the locations of the SNPs, 
and (2) decoding the sequence pattern. The locations of SNPs are typically deter- 
mined using “SNP calling” (Nielsen et al., 2011). Therefore, Haplotype phasing 
usually refers to the second task: identifying the pattern of the SNPs. Each element 
in the pattern takes a value from the set of four letters {A, C, 7, G}. You may be 
wondering how this is related to the community detection problem, where each 
component in the community membership vector is binary. 
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major major minor minor 
Ch1 (M) c T 


Human 1 
Human 2 


Human 3 T 


Human 4 


Human i G 


Figure 3.15. Major vs. minor allele. 


Binary representation of major and minor allele In fact, a key property 
of the values taken by each base in the SNPs enables a concrete connection with 
the community detection problem. Specifically, there are two types of base compo- 
nents: (1) major allele; and (2) minor allele. The major allele is the base that occurs 
for a majority of human beings, while the minor allele is the one that occurs for a 
minor portion of humans, as shown in Fig. 3.15. Note that in the example, the first 
SNP reads A for a majority of humans, representing the major allele (denoted by 0), 
while Human 3’s SNP in that position reads T, which is a rare occurrence and is 
classified as a minor allele (denoted by 1). Any letter except the one associated with 
the major allele is considered a minor allele, such as T, C, and G in this example. 
Because each SNP can be categorized into only two types, we can represent it as a 
binary value, which in turn establishes a connection to the community detection 
problem. 


Two types of SNP positions Have we made the connection between the two 
problems? Not quite yet. Although there are similarities between Haplotype phas- 
ing and community detection, there is still a fundamental difference: Haplotype 
phasing involves decoding two sequences (vectors), whereas community detection 
involves decoding only one vector. However, it is possible to cast the Haplotype 
phasing problem into a problem of decoding only one vector (up to a global shift). 
To understand this, we need to consider another property of SNP positions: each 
position is of only two types — “heterozygous” or “homozygous”. A heterozygous 
position refers to a position where the maternal base is a complement of the pater- 
nal base, while a homozygous position refers to a position where the maternal and 
paternal bases are the same. Using a standard method, one can determine the type 
of all SNP positions. Assuming that all SNP positions are heterozygous simplifies 
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the problem and makes it identical to the community detection problem. Let x be 
the sequence of SNPs from a mother. Then, father’s sequence would be its flipped 
version x ® 1. The goal of Haplotype phasing is to decode x or x @ 1. 


Mate-pair read (Browning and Browning, 2011; Das and Vikalo, 2015) 
To establish the connection, we need to examine the type of information we have 
access to for Haplotype phasing. The information we can access is related to the cur- 
rent sequencing technology, which relies on a technique called “shotgun sequenc- 
ing”. This technique yields short fragments of the entire DNA sequence, known 
as “reads”. The length of a typical read is between 100 and 500 bases, while SNPs 
consist of around 3 million bases, and the entire DNA sequence is around 3 billion 
bases in length. As a result, the average distance between SNPs is approximately 
1000 bases, but the read length is much shorter than this distance. This implies 
that one read (spanning 100 ~ 200 bases) usually contains only one SNP. This 
presents a challenge: we do not know which chromosome each read comes from 
(either maternal or paternal). To see this, let y; denote the ith SNP contained in a 
read. Then, what we obtain is: 


x; (mother’s), 
a x; ® 1 (father’s), 


NIF NIE 


w.p. 53 
w.p. 5- 
Since the probabilities of getting x; and x; @ 1 are all equal to 4, there is no way to 
figure out x; from y;. 

To overcome this challenge, a more advanced sequencing technique has been 
developed which allows for simultaneous reading of two fragments, known as 
“mate-pair reads”. As shown in Fig. 3.16, mate-pair reads have proved to be useful. 


One advantage of the sequencing technology is that both reads come from the same 
individual, which means that the information we obtain is: 


(x; xj) (mother’s), w.p. 
(x; ® 1,x; ® 1) (father’s), w.p. 


NIi= NI= 


(iy) = | 


1000 bases (on average) 


Ch1 (M) 


Cht (F) ——E § ——_—§ gg —$_____ 


“mate-pair read” 
Figure 3.16. Mate-pair read. 
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Connection to community detection What we know for sure from (y;, yj) is 
their parity: yi Ð yj = xi ® xj. Note that this is exactly the pairwise measurement 
given for community detection. In reality, however, what we get is a noisy version 


of (yi, yj): 


(x; ® zi xj ® 2) (mother’s), w.p. 5 ; 


OI) = (x; ® 1 @ zx; ® 1 @ z) (fathers), wep. 5 


where z; indicates an additive noise induced at the ith SNP. As per extensive exper- 
iments, it is found that z;s can be modeled as i.i.d., each being according to say 
Bern(q) (meaning that the noise statistics are identical and independent across all 


SNPs). The parity is 
Jij = Ji D yj = xi OH Ọ z; DZ. 


Let zj := z; ®z;. Then, its statistics would be z; ~ Bern(24(1 — q)). Why? Using 
the total probability law, we get: 


Plíz; = 1) = P@ = 1)P(z; = Ola; = 1) + Phe = OVP = lle; = 0) 
=40 -9 +0 -94 
= 2401-4) 


where the second equality is due to the independence of z; and z;. Denoting 0 = 
2q(1—q), we see that the measurement yj is a noisy version of xj. One may wonder 
if looking at the parity only (instead of individual measurements (y;, y;)) suffices to 
decode x. In other words, is y; is a sufficient statistic? It is indeed the case. Check 
in Prob 8.5(c). This suggests that we do not lose any information loss although we 
consider only parities. 


Translation into a communication problem The connection as above was 
made in (Chen eż al., 2016a). The authors in (Chen et al., 2016a) applied the con- 
nection into a partial and random measurement model where the parity is observed 
with probability p, independently from others: 


xi Ð xj D Zip Wp. p; 
Jij = 
e, w.p.l—p 


where z; ~ Bern(@) and0 e (0, 5). Without loss of generality, assume that 
0<0< $; otherwise, one can flip all 0’s into 1’s and 1’s into 0’s. Since we wish 
to infer x from yj’s (an inference problem), this problem can be interpreted as a 


communication problem illustrated in Fig. 3.17. 
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- 


Figure 3.17. Translation of Haplotype phasing into a communication problem under a 
noisy channel with partial observations. 
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Figure 3.18. The effect of noise upon the limit. 


The fundamental limit (Chen et al., 2016a) The above model subsumes the 
noiseless scenario 9 = 0 as a special case. One can easily expect that the larger 0, 
the larger the limit p*. It is shown that the fundamental tradeoff behaves like: 


_ Inn 1 


P =a Tae OSI) (3.27) 


where KL(-||-) denotes the Kullback-Leibler (KL) divergence defined w.r.t. a natural 
logarithm: 


0.5 0.5 
KL(0. = 0.5 ln — + 0.5 In —— 
(0.5||@) 0.5in—- +0.5In 
= 0.51 : 
ROL =O) 
Plugging this into (3.27), we get: 
_ Inn 1 


* 


Pn esa 


Indeed the limit is an increasing function of 0. It grows exponentially with 6. See 
Fig. 3.18. 


(3.28) 
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The only distinction in the noisy setting relative to the noiseless counterpart 


is that we have a factor ——2, =; in (3.27), reflecting the noise effect. In the 
1 RLO5I0) 8 


noiseless setting 0 = 0, KL(0.5||@) = 00, so it reduces to the limit nn 


Look ahead We related community detection to one of applications in compu- 
tational biology: Haplotype phasing. We showed that Haplotype phasing is a noisy 


version of the community detection problem, and claimed that phase transition 
— Inn 


occurs on observation probability like: p* = ** - ere In the next section, 


we will prove the achievability of the claimed limit. 


212 Data Science Applications 


3.6 DNA Sequencing: Achievability Proof 


Recap Inthe previous section, we made a connection between Haplotype phasing 
and community detection. We showed that Haplotype phasing is a noisy version 
of community detection, wherein the goal is to decode x = [x1, . . . , xy] from yj’s: 


o Jx DX @ zij, w.p. p; 
Jj = e, w.p. 1 —p 


where z; ~ Bern(@) and @ e (0, 5). We then claimed that as in community 
detection, phase transition occurs on observation probability: 


* 


_ Inn 1 
T= eH KLO.519) 


where KL(0.5||9) denotes the KL divergence between Bern(0.5) and Bern(@) 
defined w.r.t. the natural logarithm: 
0.5 0.5 1 

KL(0.5||0) := 0.5 In — + 0.5 In —— = 0.5 log ————. 

Mle) oa ai aaa 8 46(1 — 0) 

Outline In this section, we will demonstrate that the limit is achievable, and we 
will do so in three steps. First, we will derive the optimal ML decoder. Next, we 
will analyze the error probability. Although the overall procedure of the proof is 
similar to that in the noiseless case, there are a few key differences that require the 
use of important bounding techniques, which we will detail. Finally, by applying 
these bounding techniques to the error probability, we will prove the achievability. 


The optimal ML decoder As in the noiseless case, we employ the same deter- 
ministic encoder which yields a codeword matrix X(x) with the (7,7) entry xj = 
x; ® x;. A distinction arises in the decoder side. The optimal decoder takes the ML 
decision rule. But the ML decoder is not the one based solely on the concept of 
compatibility vs. incompatibility which formed the basis of the noiseless case. To 
see this, consider the following example in which n = 4, the ground-truth vector 
x = 0, an observation matrix Y reads: 


oOo Oo 
= © 0 


(3.29) 


A key observation is that not all the observed components match the correspond- 
ing x's. In this example, noises are added to the (1, 2) and (3, 4) entries, yielding 
(912.34) = (1, 1). This makes the calculation of P(Y|X(x)) a bit more involved, 
relative to the noiseless case. The likelihood function is not solely determined by 


DNA Sequencing: Achievability Proof 213 


erasure patterns but is also influenced by flipping error patterns, which creates a 
distinction between the noiseless and noisy cases. In the absence of noise, the like- 
lihood takes on either a value of 0 or a specific non-zero value. However, in the 
presence of noise, the likelihood can take on multiple non-zero values, depending 
on the flipping error patterns. To illustrate this, consider the following examples. 
Given Y in (3.29), the likelihoods are: 


P(¥|X(0000)) = (1 — p)*p* -67(1 — 0)’; 
P(YIX(1000)) = (1 — p)*p*- (1 — 0); 
P(Y|X(0100)) = (1 — p)*p*-@°(1 — 0)' 


where 


1 1 0 
1 


X(1000) = 


11 
9 j , X(0100) = 


In all likelihoods, the first product term (1 — p)*p4 is common. So the second term 
associated with a flipping error pattern will decide the ML solution. The num- 
bers marked in red indicate the numbers of flips. Since the flipping probability 
8 is assumed to be less than 5, the smallest number of flips would maximize the 


likehlihood, thus yielding: 
Xm_ = arg min d (X(x), Y) (3.30) 


where d(-,-) denotes the Hamming distance: the number of distinct bits between 
the two arguments (Hamming, 1950). 


A setup for the analysis of the error probability For the achievability proof, 
we analyze the probability of error. Taking the same procedures as in the noiseless 


case, we obtain: 
P, := Phm ¢ {x, x ® 1}) 
= Pam. ¢ {0, 1}|x = 0) 
< >) Pm =alx = 0) 


a¢{0,1} 


n—-1 
=>) dD P@m = ale = 0) 


k=1 aE A; 


(3.31) 


n—1 


=>: (een = alx = 0) 


k=1 
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k(n — k) 


Figure 3.19. Illustration of the distinguishable positions colored in purple. Unlike the 
noiseless case, all that matters in a decision is the Hamming distance. So the error event 
implies that the number of 1’s is greater than or equal to the number of O’s in the k(n—k) 
distinguishable positions. 


where the first equality is by symmetry (each error probability does not depend 
on the input vector pattern); and the inequality comes from the union bound. In 
the second last equality, we take an equivalent yet more insightful expression, by 
introducing a set Ay := {a : |la||, = k}. 


A bound on P(m = alx = O) Focus on P(xm_ = ag|x = 0). Unlike the 
noiseless case, deriving its upper bound is not that straightforward. To see this, 
consider a case in which ap = (1---10---0) € Ag. Notice that X(a,z) takes 1’s 
only in the last n — & positions of the first k rows (that we called distinguishable 
positions in light of X(0)). See the middle in Fig. 3.19. 

In the noiseless case, the error event {£m = ag|x = 0} must imply that such 
positions are all erased, since otherwise X (ap) is incompatible. In the noisy case, on 
the other hand, the error event does not necessarily imply that the distinguishable 
positions are all erased, since X(az) could still be a candidate for the solution even if 
a few observations are made in such positions. As indicated in (3.30), what matters 
in a decision is the Hamming distance. As long as d(X(az), Y) < d(X(0), Y), the 
codeword X (a+) can be chosen as a solution, no matter what the number of erasures 
is in those positions. Hence, the error event only suggests that the number of 1’s is 
greater than or equal to the number of O’s in the &(7 — k) distinguishable positions. 
Let M be the number of observations made in the distinguishable positions. Then, 
we get: fora E Az, 


Plên = alx = 0) 
< P(d(K(az), Y) < dX (0), Y)|x = 0) 
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= P(# 1’s > # 0’s in the distinguishable positions|* = 0) 


oa 
2 > PM=tx=0) 
€=0 
x P(# 1’s > # 0's in distinguishable positions|x = 0, M = £) 


(3.32) 


where (a) is due to the total probability law. Here PW = ¢lx = 0) = 
(0D) of (1 — pt0, 

Consider P(# 1’s > # 0’s in distinguishable positions|x = 0, M = £). Let Z; be 
the measured value at the ith observed entry in the k(n — k) distinguishable posi- 
tions. Then, Z;’s are i.i.d. ~ Bern(@) where ż € {1,2,...,€}. Using this, we get: 


P(# 1’s > # 0’s in distinguishable positions|x = 0, M = £) 
Z+2+..- +Z (3.33) 
-=r( HD puet 205). 


€ 


Since Z;’s are iid. ~ Bern(@), one may expect that the empirical mean of Z;’s 
would be concentrated around the true mean 6 as £ increases. It is indeed the case 
and it can be proved via the WLLN. Also by our assumption, 9 < 0.5. Hence, the 
probability p(Aatat+4 > 0.5) would converge to zero, as tends to infinity. 
What we are interested in here is how fast the probability converges to zero. There is 
a very well-known concentration bound which characterizes a convergence behav- 
ior of the probability. That is, the Chernoff bound (Bertsekas and Tsitsiklis, 2008; 
Gallager, 2013), formally stated below: 


p (= + 2) + +++ + Ze > 05) < eEKL(0.511). (3.34) 


Check Prob 4.4 for the proof. 
Applying this into (3.32), we get: 


P (fm = a,|x = 0) 


k(n—k) 
< >) PM = £x =0) 


€=0 


x P(# 1’s > # 0's in distinguishable positions|x = 0, M = £) 
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k(n—k) 


g 5 on ”) a — pytr--F,-€KL0.510) 


€=0 
k(n—k) 


aah 2-KL(0.5|10) t 
k(n—k) = 
-peen Ss (N(ES 


k(n—k) 
—~KL(0.5||9) 

G z e 

Gap" afis? = ) 


= (1 — p(1 — e7 KLO519)) ko 


where (a) comes from P(M = £|x = 0) = (PO) of — pfa 
and the Chernoff bound (3.34); and (4) is due to the binomial theorem: 
Zio (7) = +9)” 


The final step of the achievability proof Putting the above to (3.31), we get: 


n—1 
P, < (eén = a,|x = 0) 
1 


k= 


n—1 
< (;)¢ = p(1 =e KLOSIM yy Ho, 


k=1 


Remember what we proved in the noiseless case (check the precise statement in 
Prob 7.3): 


n—l 

l 
> (o — qr) — 0 ifq> a 
k=1 á 


Hence, by replacing g with p(1 — e~*+>ll)) in the above, one can make P, arbi- 
trarily close to 0, provided that 


Inn Inn 1 
—KL(0.5||9) pues . 
be a e eor ae 


pl-e 
This completes the achievability proof. 


Look ahead We have proved the achievability of the limit in the noisy observa- 


tion model p* = "# 


1 . . 
r IKO: In the next section, we will prove the converse. 
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3.7 DNA Sequencing: Converse Proof 


Recap We proved the achievability of the limit on observation probability p in a 
noisy community detection problem: 


Inn 1 


P TOET) => P> 0 an o. 


Outline In this section, we will prove the converse: 


Inn 1 
n 1 —e-K5I19) 


=> P, ~ 0. 
Proof strategy In the noiseless case, the converse proof is based on graph con- 
nectivity: 

graph is connected —> P, > 0. 


Hence, we focused on checking whether P(connected) converges to 1 depending 
on conditions of observation probability. In the noisy case, however, checking graph 
connectivity is not sufficient because the event of graph being connected does not 
necessarily imply reliable detection: 


graph is connected Š P > 0. 


So we will start from scratch. Starting with the definition of the probability of 


error, we get: 


P, := 1 — P (success) 


£ 1 — P ({$ = 0} U {$ = 1}|x = 0) 


= 1 — P(è = 0|x = 0) — P@ = 1|x = 0) 
1 — 2P(x = O|x = 0) 


where (a) is by symmetry; (4) follows from the fact that the two events are disjoint; 
and (c) is due to the fact that {* = 0} and {* = 1} are equally likely (there is no 
way to disambiguate x and x @ 1 from pairwise comparisons; the only way that we 
can do in this case is to flip a fair coin). Hence, it suffices to show that 

Inn 1 


< ae => P(« = 0|x = 0) > 0 asn- œ. (3.35) 
n l—e 


Here we define D* := KL(0.5||9) for notational simplicity. 
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An upper bound on P(X = O|x = O) In the converse proof, one cannot make 
any assumption on the decoder type. This is because we wish to come up with a 
necessary condition that holds under any arbitrary decoder. However, there is one 
exceptional case where one can make an assumption. That is the case in which the 
optimal decoder is employed. Notice that a necessary condition w.r.t. the optimal 
decoder also holds for any other decoder. Let P.(opt) and P.(a decoder) be error 
probabilities w.r.t. the optimal decoder and a particular decoder, respectively. Then, 
by the definition of optimality: 


P.(a decoder) > P,(opt). 


We see that P, (opt) > 0 implies P.(a decoder) - 0. Hence, it suffices to show 
that P, =» 0 under the optimal decoder. This allows us to assume the use of the 
optimal decoder: 


£ = arg min d (X(x), Y) 


where d(-, -) indicates the Hamming distance. In the noisy observation model, the 
optimal decoder minimizes the Hamming distance; see (3.30). For notational sim- 
plicity, define d(x) := d (X(x), Y). 

The interested event {x = 0} implies that {d(a) > d(0)}, Va # 0. This 
together with the fact that P(A N B) < P(A) for any two events (A, B) gives: 


P(é = 0|x = 0) < P [a > d(0)}|x = 0 


a+0 
(3.36) 


< P| [N 12) > d@)jlx = 0 
acA; 


where A; = {a : |la||) = 1,4; € {0, 1}}. 

Remember the two key observations that we made in the noiseless case: (1) the 
calculation of the above probability can be greatly simplified when the associated 
events are independent; and (2) the more independent events are, the tighter bound 
we get. This again motivates us to search for independent events (among 7 events) 
as many as possible. 

As in the noiseless case, the number of independent events is the same as the 
number of locally disconnected nodes. To see this, refer to Fig. 3.20. 

Like the noiseless case, consider an event J in which there are at least |% 
nodes which are locally disconnected, i.e., y; = e for 1,7 € {1,2,..., |= J}: 


r-k 
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an independent 


z ] 
lnn 


|| th position 


Inn 


Figure 3.20. Independent events vs. locally disconnected nodes. 


where L indicates the number of locally disconnected nodes. We assume that the 
locally disconnected nodes are numbered as 1, 2,..., | 2% J. 

Given 7, the event {d(10---0) > d(O)|x = 0} is a sole function of y1;’s 
(marked in light blue in the figure) where j € {|| + 1,..., 2}. Notice that 
(1,/)’s indicate distinguishable positions where the corresponding codeword entries 
differ w.r.t. the messages (10---0) and 0. Hence, the difference between Ham- 
ming distances, d(10---0) — d (0), depends only on the positions. Similarly the 
event {4 (010 ---0) > d(0)|x = 0} is a sole function of y2;’s (marked in purple). 
Hence, the two events are independent since yj;’s are y2;'s are disjoint. Extending 
this argument to other events, given (T, x = 0), we obtain: 


{d(10---0) > d(0)}L{d(010---0) > d(O)}L--- 
L{d(0--- els ---0) > d(O)}. (3.37) 
L£% J th position 


As was shown in Section 3.3, the event 7 occurs w.h.p. for a = 5 — 4 where A 
is the prefractor that appears in p = jinn, P(T) — lasa — oo. Applying this 


to (3.36), we get: 


PE = Olx = 0) < P | ()} {d(a) > d@)}\x=0 
acA 


=P N {d(a) > d0) $, Tlx = 0 
acA 
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+P | PECORE son ,T‘|x =0 


acA 
g P | N {d (a) > son |x = 0,7 (3.38) 
acA, 
() 
< [] Paw = d@)|x = 0,7) 
acB, 


© p(d(10---0) > d(0)|x = 0, TE 
= {1—P(d(10---0) < d()|x = 0, T)} Line! 


where (a) follows from P(T) —> 1 and P(T®) — 0; (b) comes from (3.37) and 
the definition Bı := {b : [bl]; = 1,8; € {1,2,..., LEZ]; and (c) is due to 


symmetry. 


A lower bound on P(d(10--- O) < A(O)|x = 0,7) The last equation in (3.38) 
motivates us to explore a lower bound on P(d(10---0) < d(O)|x = 0,7), as it 
yields an upper bound on P(x = 0|x = 0). 

Given T, only the last 7 := n — |" | positions in the first row of Y are dis- 
tinguishable between messages 0 and (10---0). The event {d(10---0) < d(O)} 
depends solely on those positions, which in turn suggests that the number of 1’s is 
greater than the number of 0’s in the distinguishable positions. See Fig. 3.21. 


X(00---0) X(10---0) Given T : Y 


Figure 3.21. An equivalent condition. 


DNA Sequencing: Converse Proof 221 


Let M be the number of observations in the distinguishable positions. Then, 
we get: 


P(d(10---0) < d(0)|x = 0,7) 
= P(# 1’s > # 0's in the distinguishable positions|x = 0, 7) 


2 PUM = tlx = 0,7) 


c=0 


x P (# l’s > # 0’s in distinguishable positions|x = 0, T, M = £) 


put y a-p (ata eZee 3) 
(3.39) 


where (a) is due to the total probability law; and (4) comes from P(M = £|x = 
0,7) = (pa — p)"—©. We define Z; as the measured value at the ith observed 
entry in the % distinguishable positions. Then, Z; iid. ~ Bern(@) where i € 
{1,..., €}. 
The Chernoff bound (see Prob 4.4) provides an upper bound on P(Z; + Z2 + 
-+ Ze > £). What we are interested in is a lower bound though. It turns out 
the interested probability has the same order of the upper bound e~”. More con- 
cretely, given € > 0, there exists no such that for € > no: 


£ : 
P(2+2++2 > A ao, 


Here nọ = Inz is one such choice under which the above holds. Check this in 
Prob 8.2. This together with (3.39) gives: 


P(d(10---0) <d(0)|x = 0, T) 


n 


n =p % 
> >> (“oa — p)” lo (1+e)€D 
a R, >h; Je a n—l ,—(1+6)eD* 


n iy (1te)D* Nf 
va- sens ) 
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—(1+e)D* ñ 
Oq př pe 
= (hp) (: + i= ) 


= (1 — p(l — gree 7 
> (1 — p0 =o FP? yy (3.40) 


The step (a) is due to the fact that the summation up to ln is negligible relative 
to the entire sum. Precisely speaking, there exists 6, —> 0 such that the step (a) 
holds. Check this in Prob 8.2. The step (b) comes from 6, — 0 and (c) is due to 
the binomial theorem. 


The final step We are ready to complete the proof. Putting (3.40) to (3.38), 
we get: 


PÈ = O|x = 0) 
S {1-P(d(10---0) < d(0)|x = 0, T)} Linn! 
nieder 
2 exp f-0 -20 - tty 27} 


Inn 
(6) — pple UtOD*) an 
a exp |—e j Tl 


= exp Eaa ln n—ln na 

where (a) is due to the fact that 1—x < e~* forx > 0; and (6) comes from the fact 
that (1 — p(1 — eG +9)2"))2 ~ oP OF" Yn 
p= jinn (which is our case). Therefore, if 2 < =a. the upper bound goes 
to 0 as n > oo. This completes the proof (3.35). 


for sufficiently large 7 and small 


Look ahead We have proven the limit on p for the noisy community detection: 


Inn 1 


a 1— e090 —— P, > 0. 


Our achievability is based on the ML decoder which suffers from high compu- 
tational complexity. In the next section, we will explore efficient algorithms that 
possibly yield the optimal performance as the ML rule. 
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3.8 DNA Sequencing: Algorithm and Python 
Implementation 


Recap We have proved the achievability and converse of the limit in the noisy 
observation model (inspired by Haplotype phasing): 


„_ lna 1 _ Inn 1 
P =a 1e a A 


where @ is the flipping error rate. Since the optimal ML decoding rule comes with a 


challenge in computational complexity (as in the noiseless case), it is important to 
develop computationally efficient algorithms that achieve the optimal ML perfor- 
mance yet with much lower complexities. Even in the noisy setting, such efficient 
algorithms are already developed. 


Outline In this section, we will investigate two efficient algorithms for Haplotype 
phasing in the presence of noise. The first method is the same as the one we used 
in the noiseless scenario, which is the spectral algorithm. The second algorithm is 
a slightly more complex version that involves obtaining an initial estimate through 
the spectral algorithm and refining it with an additional operation to improve per- 
formance (Chen et al., 2016a). This second algorithm is not only still efficient 
but also optimal. However, as the focus of this book is not on optimality proofs, 
we will concentrate on explaining how the algorithms work and providing their 
Python implementations. This section consists of four parts. First, we will review 
the spectral algorithm. Second, we will apply it to the noisy observation setting 
and evaluate its performance through Python simulation. Third, we will describe 
how the second algorithm works. Finally, we will implement the second algorithm 
with the additional operation in Python to demonstrate that it outperforms the 
first spectral algorithm. 


Review of the spectral algorithm The spectral algorithm is based on the adja- 
cency matrix A € R”*” where each entry aj indicates whether users i and j are in 
the same community: 


1 —2y;;, w.p. p; 
“=| á PP (3.41) 


0, w.p. 1 — p (yz = e) 


where yj = x; ® x; ® zj and zj’s are i.i.d. ~ Bern(@). Here a; = 0 denotes no 
measurement. In the noiseless case, aj; = +1 means that the two users are in the 
same community; and aj; = —1 indicates the opposite. 

Inspired by the fact that the rank of A is 1 under the ideal situation (the noiseless 
full measurement setting) and the principal eigenvector matches the ground-truth 
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community vector (up to a global shift), the spectral algorithm computes the princi- 
pal eigenvector to declare its thresholded version as an estimate. For computational 
efficiency, it employs the power method for computing the principal eigenvector 
(instead of eigenvalue decomposition). Here is the summary of how it works. 
1. Construct the adjacency matrix A as per (3.41). 
2. Choose a random vector v and set v = v and t = 0. 
Ay“) 
|| Av ||? 


4, Iterate Step 3 until converged, e.g., Iwo) — vl) ||? < e= 107. 


3. vet) — and increase ¢ by 1. 


Python implementation of the spectral algorithm We implement the spec- 
tral algorithm via Python. The code below is almost the same as in the noiseless 
setting. The only distinction is that yj is a noisy version of x; ® xj. We consider a 
setting where the flipping error rate 9 = 0.1 and m = 4000. 


from scipy.stats import bernoulli 
from scipy.stats import pearsonr 
import numpy as np 

import matplotlib.pyplot as plt 


n = 4000 # number of users 

Bern = bernoulli(0.5) 

# Generate n community memberships 
x = Bern.rvs(n) 


# Construct the codebook 
X = np.zeros((n,n)) 
for i in range(len(x)): 
for j in range(i,len(x)): 
# Compute xij = xi + xj (modulo 2) 
XLi j] = (xLi]J+x[j]) % 2 
# Symmetric component 
XUj.iJ = X[i,j] 


# Construct an observation matrix 
theta = 0.1 # noise flipping error rate 
noise_Bern = bernoulli(theta) 
noise_matrix = noise_Bern.rvs((n,n)) 
Y = (X + noise_matrix)%2 


p = np.linspace(0.001,0.0065,30) 
limit = 1/Ci-np.sqrt(4*theta*(1-theta)))*np.log(n)/n 
p_norm = p/limit 
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def power_method(A, eps=le-5): 
# A computationally efficient algorithm 
# for finding the principal eigenvector 
# Choose a random vector 
v = np.random.randn(n) 
# normalization 
= v/np.linalg.norm(v) 


prev_v = np.zeros(len(v)) 


t=O 
while np.linalg.norm(prev_v-v) > eps: 
prev_v=v 


v = np.array(np.dot(A,v)).reshape(-1) 
v = v/np.linalg.norm(v) 
t+=1 
print(" Terminated after %s iterations"%t) 
return v 


corr = np.zeros_like(p) 


for i,val in enumerate(p): 
obs_bern = bernoulliCval) 
# Construct an n-by-n mask matrix: 
# entry = 1 (observed); O (otherwise) 
mask_matrix = obs_bern.rvs((n,n)) 


# Construct the adjacency matrix 

A = (1-2*Y)*mask_matrix 

# Power method 

vl = power_method(A) 

# Threshold the principal eigenvector 
vl = np.sign(v1) 

# Compute the ground truth 
ground_truth = 1-2*x 

# Compute Pearson correlation 
corr[i] = np.abs(pearsonr(ground_truth,v1)[O]) 
print(p_norm[i], corr[i]) 


plt.figure(figsize=(5,5), dpi=200) 

plt.plot(p_norm, corr) 

plt.title” Pearson correlation btw estimate and ground truth’) 
plt.gridClinestyle=":’, linewidth=0.5) 

plt.show() 


As depicted in Fig. 3.22, the Pearson correlation approaches 1 as p approaches the 
limit p*. As previously noted, there exists an alternative method that surpasses the 
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Pearson correlation btw estimate and ground truth 
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Figure 3.22. Pearson correlation between the ground-truth community vector and the 
estimate obtained from the spectral algorithm: n = 4000 and 6 = 01. 


performance of the spectral algorithm. Let us now examine this algorithm in further 
detail. 


Additional step: Local refinement The other algorithm takes two steps: (i) 
running the spectral algorithm; and then (2) performing an additional step called 
local refinement. The role of the second step is to detect any errors assuming that 
a majority of the components in the estimate vector are correct. The idea of local 
refinement is to use coordinate-wise maximum likelihood estimator (MLE). Here 
is how it works. Suppose we pick up an user, say user 7. We then compute the 
coordinate-wise likelihood w.r.t. the membership value of user 7. However, com- 
puting the coordinate-wise likelihood presents a challenge as it requires knowledge 
of the ground-truth memberships of other users, which are not revealed. As a sur- 


(0) (0) 
ages 


rogate, we employ an initial estimate, say xO) = [x .,X,’], obtained in the 


earlier step. Specifically, we compute: 


ME Sarg mas P(Y |x, xO), a, s9, xO), 


É ae{x) xO @1} 


DNA Sequencing: Algorithm and Python Implementation 227 


To find x™!E, we only need to compare the two likelihood functions w.r.t. a = x) 
and a = x © 1. It boils down to comparing the following two: 


DY HOP Ox” we D yex @1ex” (3.42) 
EGER EGER 
where Q indicates the set of (ż, j) pairs such that y; 4 e. In order to gain an 


insights into the above two terms, consider an idealistic setting where x) matches 
the ground truth. In this case, the two terms become: 


5 Zij vs > zy @ 1. (3.43) 


JG j)EQ J:i j)EQ 


Since the noise flipping error rate is assumed to be less than 0.5, the first term is 
likely to be smaller than the second. Hence, it would be reasonable to take the candi- 
date that yields a smaller value among the two. This is exactly what the coordinate- 
wise MLE does: 


0 0 0 0 0 
O, Lig Ox OX <E a 10x”; 
x) ® 1, otherwise. 


We follow the same procedure for all other users, performing the local refine- 
ment step iteratively and step-by-step. This forms one iteration to yield x“) = 
[xj"E,...,xRE]. It has been shown in (Chen et al., 2016a) that with multiple iter- 
ations (around the order of In 7 iterations), the coordinate-wise MLE converges to 


the ground-truth community vector. We will not provide a proof of this fact, but 
instead, we will present simulation results that demonstrate the improved perfor- 
mance offered by local refinement. Below is the code for local refinement: 


# initial estimate obtained from the spectral algorithm 
xO = (1- v1)//2 
xt = xO 


ITER = 3 
for t in ranged,ITER): 
xt] = xt 
for i in rangeClencxt)): 
# Likelihood w.rt. x_i (© 
L1 = (Y[i,:] + xt[i] + xt)%2 
L1 = Ll*mask_matrix[i,:] 
# Likelihood w.rt. x_i°(D+1 
L2 = CY[i,:] + xt[i] + 1+ xt)%2 
L2 = L2*mask_matrix[i,:] 
xti[i] = xt[i]*(sum(L1) <= sum(L2)) \ 
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+ (CxtLiJ+1)%2)*(sum(L1)>sum(L2)) 
xt = xt] 


Performances of the spectral algorithm vs local refinement We compare 
the Pearson correlations of the spectral algorithm and the two-step approach with 
local refinement, for a setting where n = 4000 and 0 = 0.1. We set the number of 
iterations in the local refinement step as 9, since it is close to the suggested number 
(In 4000 = 8.294). Here is a code for simulation. 


from scipy.stats import bernoulli 
from scipy.stats import pearsonr 
import numpy as np 

import matplotlib.pyplot as plt 


n = 4000 # number of users 

Bern = bernoulli(0.5) 

# Generate n community memberships 
x = Bern.rvs(n) 


# Construct the codebook 
X = np.zeros((n,n)) 
for i in range(len(x)): 
for j in rangeci,len(x)): 
# Compute xij = xi + xj (modulo 2) 
XLij] = fi] +xGj]) % 2 
# Symmetric component 
XU] = X[ij] 


# Construct an observation matrix 
theta = 0.1 # noise flipping error rate 
noise_Bern = bernoulli(theta) 
noise_matrix = noise_Bern.rvs((n,n)) 
Y = (X + noise_matrix)%2 


p = np.linspace(0.001,0.0065,30) 
limit = 1/(1-np.sart(4*theta*(l-theta)))*np.log(n)/n 
p_norm = p/limit 


corrl = np.zeros_like(p) 
corr2 = np.zeros_like(p) 


for i, val in enumerate(p): 
obs_bern = bernoulliCval) 
# Construct an n-by-n mask matrix: 
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# entry = 1 (observed); O (otherwise) 
mask_matrix = obs_bern.rvs((n,n)) 


HHHHHHHHHHHHHHEHHAAAHAHHAAAAAHHARE 
H#HHHHHHH Spectral algorithm ###### 
HHHHHHRHHHHHHHHHEHAHAHHHHAAAAAHHAAE 
# Construct the adjacency matrix 

A = (1-2*Y)*mask_matrix 

# Power method 

vl = power_method(A) 

# Threshold the principal eigenvector 

vl = np.sign(v1) 


HHHHHHHHHHHHHHHHHAAAHHHRAAAAAAHARE 
HHHHHHHH Local refinement #HHHHHHHH 
HHHHHHHHHHHHHHHEEHAAAHHHHAAAAAHHARE 
# initial estimate (from the spectral algorithm) 
xO = (1- v1)//2 
xt = xO 
# number of iterations 
ITER =9 
for t in rangeC,ITER): 
xt] = xt 
# coordinate-wise MLE 
for k in rangeClen(xt)): 
# Likelihood w.rt. x_k {(Ð2 
L1 = (Y[k,:] + xt[k] + xt)%2 
L1 = LIl*mask_matrix[k,:] 
# likelihood w.r.t. x_k {()}+1 
L2 = (Y[k,:] + xt[k] + 1+ xt)%2 
L2 = L2*mask_matrix[k,:] 
xt1[k] = xt[k]*(sum(L1) <= sum(L2)) \ 
+ ((xt[k]+1)%2)*(sum(L1)>sum(L2)) 
xt = xt 


# Compute the ground truth 

ground_truth = 1-2*x 

# Compute Pearson correlation 

corrl[i] = np.abs(pearsonr(ground_truth,v1)[O]) 
corr2[i] = np.abs(pearsonr(ground_truth,1-2*xt)[0O]) 
print(p_norm[i], corri[i], corr2[i]) 


plt.figure(figsize=(5,5), dpi=200) 

plt.plot(p_norm, corr, label=’spectral algorithm’ ) 
plt.plot(p_norm, corr2, label=’local refinement’) 

plt.title” Pearson correlation btw estimate and ground truth’) 
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plt.legendQ 
plt.grid(linestyle=":’, linewidth=0.5) 
plt.showQ 


Pearson correlation btw estimate and ground truth 


— spectral algorithm 
— local refinement 


Inn, 
n 1~,/40(1—6) 


Sls 


Figure 3.23. Pearson correlation performances of the spectral algorithm and local refine- 
ment: n = 4000 and 6 = 01. 


Fig. 3.23 demonstrates that the inclusion of the local refinement step leads to an 


improvement in performance. 


Look ahead We have investigated two data science applications that exhibit 
phase transitions. The subsequent section will delve into another data science appli- 
cation that involves phase transition. 
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Problem Set 8 


Prob 8.1 (Reverse Chernoff bound) Suppose we observe 7 i.i.d. discrete ran- 
dom variables Y” := (Y,..., Y,). Consider two hypotheses Hp : Y; ~ Poy); 
and Hı : Y; ~ Pi(y) for y € y and i € {1,..., n}. Define the Chernoff informa- 
tion D* as: 


ee, ds 1-1 
D = ae In > Poy)’ Pity) 
yey 
Let L; be the likelihood function w.r.t. Hy: 
L, = P(Y" |). 


Assume that Po ~ Bern(@) and Pı ~ Bern(1 — @) for a fixed 0 € (0, 5). 


(a) Compute D*. 
(6) Show that 


P(L > LolHo) < e”. 
(c) Fix € > 0. For sufficiently large 7, show that 
P(C: > LolHo) = OF", 
Prob 8.2 (Useful bounds) Let p= Aina for some positive constant A. 
(a) Show that 
l 


nna—-l Inn 
> (o -pE < (1—p)"Inn (2) ; 
= € l—-p 


(6) Show that 


b3 (o —pyr =p) (: + 4) : 


€=0 


(c) Show that there exists €» > 0 such that €, —> 0 as n — oo and 


Z n ¢ — pyn-t E yA R ” 
Èh) =A > en-e) (+5); 
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Prob 8.3 (Generalized Chernoff bound) Suppose we observe 7 i.i.d. discrete 
random variables Y” := (Y1,..., Yn). Consider two hypotheses Ho : Y; ~ Po(y); 
and H; : Y; ~ Pı (y) for y e VY. Define the Chernoff information as: 


x oe " Ap 1-1 
De a In 2 Pon 1(y) 
ye 


Let L; be the likelihood function w.r.t. Hg: L; = P(Y ”|H}). 
(a) Show that for a, 6 > 0 and / € [0, 1], 


min{a, b} < a b7}, 


(b) Let A := {y”: P(y”|A1) > P(y”|Ho)}. Show that 


PL: > Lolo) = X | [PO>. 


y”EA i=l 
(c) Using parts (4) and (b), show that 
P(Li > LolHo) < e”. (3.44) 


Prob 8.4 (A generalized model for community detection) Suppose there 
are 7 users clustered into two communities. Let x; € {0, 1} indicate a community 
membership with regard to user 7 € {1,2,..., 7}. Assume that we are given part of 
the comparison pairs: 


ee ia P(yylxy), w.p. ps 
á e, w.p. 1 — p 


for every pair (4j) € {(1, 2), (1,3), ++- , (n — 1, n)} and p € [0, 1]. Whenever an 
observation is made, y; is generated as per: 


Po(yj), xij = 0; 
Piy) xj=1. 


P(yylxq) = | 


We also assume that y;’s are independent over (ż, j). Given yj’s, one wishes to 
decode the community membership vector x := [x1,x2,...,x,] rx ® 1 := 
[x1 ® 1,x2 @ 1,...,%n ® 1]. Let £ be a decoded vector. Define the probability 
of error as P, = P(x ¢ {x,x ® 1}). Suppose we employ the ML decoding rule: 


Xm_ = arg max L(x) 


where L(x) := P(Y|X(x)) indicates the likelihood function w.r.t. x. 
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(a) Show that 


where az = [1,1,...,1,0,0,..., 0]. 
ell 


k n—k 
(b) Let M be the number of observations made in the distinguishable positions 
between X (az) and X(0). Show that 


k(n—k) 


Pèu = a,|x = 0) < ` (e Z Ya — pt0 


e 
€=0 
x P(L(a,) > L(0)|x = 0,M = 8). 


(c) Using the above and part (c) in Prob 8.3, show that 


n—1 
pe (;) Oe Olas ae) a 
k=1 


where D* denotes the Chernoff information: 


D == - À 1—4 
min, In} > Po PLO) 
yey 


(d) Using the above and Prob 7.3, show that if p > #2 -—L.,. P, can be 


n J= D" > 
made arbitrarily close to 0 as n > 0. 


Prob 8.5 (True or False?) 


(a) Suppose that X; ~ Bern(p),X2 ~ Bern(+) and S = X, @ X. Then, S 
follows Bern(+) for any p € [0, 1]. 

(6) LetX = [Xj,.X,...,X,]7 bean i.i.d. random vector, each being according 
to Bern(4). Let A be an n-by-n full-rank matrix with Aj € {0, 1} entries. 
Let Y = AX, i.e., Y; = 2/7) AyXj. Here the summation is modulo-2 
addition. Then, Y;s are i.i.d. 
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(c) Let X; and X be independent random variables, each being according to 
Bern(3). Suppose we observe 
(Xi $ Z|, X2 ® 22), w.p. | — a; 


(Y, V) = 
(16162,,%2161624), wpa 


where Z1 and Z are independent random variables ~ Bern(q), being also 
independent of (X1, X2). Then, Yı @ Y2 is a sufficient statistic w.r.t. (X1, X2). 
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3.9 Top-K Ranking: Fundamental Limits 


Recap Throughout the previous sections, we have examined two data science 
applications that incorporate information theory, wherein a precise threshold on 
the amount of information required to perform a specific task exists. Moreover, we 
have discovered that various bounding techniques covered in Parts I and II play 
a role in characterizing the fundamental limits of the explored problems, such as 
community detection and Haplotype phasing. 


Outline In the upcoming sections, we will explore another data science applica- 
tion where phase transition occurs, specifically in the context of rank aggregation. 
Our focus will be on a particular type of ranking problem, known as top-K rank- 
ing, which aims to address the computational challenge posed by the huge number 
of items to be ranked in the big data era. We will first provide an overview of the 
ranking problem and highlight the aforementioned challenge. Then, we will intro- 
duce top-K ranking and explain how it differs from traditional ranking methods. 
Subsequently, we will present a benchmark model that can be used to evaluate 
the performance of ranking algorithms. Finally, we will investigate the fundamen- 
tal limits of top-K ranking and show how phase transition occurs in terms of the 
amount of information required to achieve reliable ranking. 


Ranking Ranking refers to the process of arranging items in order of significance. 
One example of ranking is a competition where several candidates participate, and 
the judges have to rank them in order of quality to determine the winner(s). In 
the simplest scenario, if each candidate has a score that indicates their quality, and 
those scores are known, then ranking is a straightforward task of sorting the can- 
didates according to their scores. One can employ one of the well-known sorting 
algorithms (like quick sorting, bubble sorting, merge sorting, etc') to obtain a rank- 
ing. Numerous sorting algorithms are available, which can sort a large number of 
items efficiently with a relatively small number of comparisons, typically scaling as 
nln n. However, the challenge arises when we do not know the individual scores of 
the candidates, which was assumed in the previous scenario. In reality, it is not that 
easy to assess the value of individuals. How can we assign an absolute value to people 
or items? It is nearly impossible. Therefore, in practical scenarios, individual scores 
are often not available. On the other hand, pairwise comparisons are relatively easy 
to obtain. For a given pair of candidates, it may be possible to determine which one 
is better than the other without having knowledge of their absolute qualities. 


1. These are very well known in the computer science literature. For those who are not familiar with, please 
refer to an introductory book on data structure and/or wikepedia. 
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Figure 3.24. Small-scale ranking based on pairwise comparisons. 


In situations where only pairwise comparisons are available, obtaining a ranking 
becomes a challenging task. One approach to tackle this problem is to aggregate 
all the comparison information to determine the order of all the candidates. By 
analyzing each comparison, one can determine which candidate is preferred over 
the other, and subsequently obtain a ranking. For instance, in Fig. 3.24, where 
the number of candidates is 7, this approach would require at most 21 (= E) 
comparisons. Thus, it is evident that ranking is straightforward for small-scale 
problems. 


Large-scale ranking However, as the scale of the ranking problem increases, it 
becomes increasingly challenging. A prime example is web search, which is handled 
by search engines such as Google and Bing. The number of items (in this case, web- 
sites) to be ranked is enormous, with Google alone managing billions (on the order 
of 10°) of websites. When a user enters a query, the search engine must provide a 
list of relevant websites, but the sheer volume of related websites poses a significant 
challenge. 


Challenge in large-scale ranking To illustrate the challenge, let us consider the 
naive ranking approach discussed earlier. If we were to use this approach to rank 
the websites in the web search example, we would require an enormous number of 
comparisons given that the number of websites is in the billions. Specifically, we 
would need a total of 2 (~ (5)) comparisons, which is an extremely large number, 
around 10!8 for the web search example. This means that ranking is no longer a 
trivial task, and it begs the question of whether this number of comparisons is truly 
necessary or if there exists a more efficient ranking algorithm. 
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Unfortunately, two conventional assumptions impose a fundamental lower 
bound of around n* comparisons, which cannot be beaten. The first assumption 
is that we require a complete ordering of the items, while the second assumption 
is that pairwise comparisons are given probabilistically. This second assumption is 
relevant in many practical applications, such as web search where pairwise com- 
parisons can be formed from hyperlink information. Since we cannot control the 
existence of such information, it is reasonable to assume that it is given probabilisti- 
cally. Another example is Twitter's follower information, where user A follows user 
B, and this information is also given by a context. To capture this passively given 
information, we can assume that pairwise samples are given in a random manner. 

Assuming the two conventional assumptions mentioned earlier, obtaining a 
ranking becomes a huge challenge because it requires almost all comparisons, which 
is around A This is due to the fact that in order to identify the order between two 
consecutive candidates, a direct comparison between any two adjacent items is nec- 
essary. Therefore, all comparisons are required with probability 1, making the total 
number of required comparisons too large to handle, especially when dealing with 
a large scale ranking such as web search, where the number of items (websites) can 
be in the billions. This is why Google's ranking algorithm, PageRank, is an offline 
algorithm that pre-computes ranking results for popular queries and stores them 
in a table. However, this approach is not real-time and may result in outdated or 
missing results. 

Why does the challenge arise? The challenge arises because we are focused on the 


; : . ; 2 : 
ordering of all items, which requires a fundamental lower bound of 5 comparisons. 


Top-K ranking To tackle this challenge, a straightforward yet effective approach 
can be adopted based on the following observation: in many practical scenarios, the 
focus is only on identifying a small number of significant items, such as the top-K 
ranked items, among a large number of alternatives. A classic example is web search, 
where only the top 20 or 30 relevant websites are of interest, rather than the entire 
list of relevant websites. 

This observation motivates us to shift our focus to top-K ranking, where the goal 
is to identify the top-K ranked items rather than obtaining a complete ordering. 
Clearly, this approach can significantly reduce the number of required comparisons 
for ranking. This gives rise to natural information-theoretic questions. 


e What is the fundamental limit on the number of pairwise comparisons 
required for top-K ranking? 
e Is there a computationally efficient algorithm that can achieve the limit? 


There exists the fundamental limit which is far below ~ n7/2. Also, there is an 
efficient algorithm that can achieve the limit. We will explore them in depth. 
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Figure 3.25. A comparison graph in the BTL model. 


A benchmark model The information-theoretic limit is determined by a math- 
ematical model that measures the accuracy of pairwise comparisons. The Bradley- 
Terry-Luce (BTL) model, which has been widely used as a benchmark model for 
evaluating ranking algorithms, is a prominent example of such a model (Bradley 
and ‘Terry, 1952). It is commonly believed that an algorithm that performs well 
under the BTL model will perform well in practice, although it is not always 
the case. 

The BTL model is based on two key assumptions. The first assumption is that 
there are ground truth scores, x := [x1, x2, . . . , Xn], that determine the ranking of 
the items, with higher scores corresponding to higher rankings. The second assump- 
tion is related to the quality of information available. When the number 7 of items 
is very large, it is not feasible to observe all possible pairs of items, and we can only 
access a subset of them. To account for this, we introduce a comparison graph that 
shows which pairs have been observed (see Fig. 3.25). In the comparison graph, an 
edge indicates that a comparison has been made between the two items it connects. 
For a given pair of items 7 and j that have been observed, we are provided with pair- 
wise comparison information that indicates whether item 7 is preferred over item /: 


yj = Hitemi = items}, (i) € E 


where 1{-} denotes an indicator function and E indicates the edge set. This model 

assumes that the winning rate is proportional to the relative score of the two asso- 

ciated items. Hence, the probability of item 7 winning over item j is +, which 
XIX; 

in turn yields: 


Jij ~ Bern ( st ). (3.45) 
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Figure 3.26. Translation of top-kK ranking into a communication problem. 


Notice that yj is a noisy data. In an effort to combat the noise effect, this 


model allows for repeated independent comparisons, say L repetitions: yy iid. 
i—) over £ € {1,2,..., L}. 
J 


x, 
xit 


~ Bern( 


Translation to a communication problem Interpreting x as a message that 


, . i £ . . i 
we wish to infer given { yi )}, we can view the ranking problem as an inference 
(C2) 
i 
translate the problem into a communication problem. 


problem. Note that y;-’’s are statistically related to x (see (3.45)). Hence, one can 

We start with x and this is fed into an encoder. See Fig. 3.26. The encoder is 
a pairwise information function. Since the observation quality hinges upon the 
relative score of two items involved, one can set the function so as to yield: 

Xi 
S=, 

xit xj 
Since we can access to only part of pairwise information in a random manner and 
also what we observe is the binary information, we can abstract the measurement 
process as an erasure channel: 


Bern(x;), w.p. p; 


Jij = 
e, w.p. 1 — p 


where p indicates the observation probability. Assume that whenever an observation 
is made (i.e., (4,7) € E£), L independent copies are given: yy’ iid. over £ € 
{1,2,...,L}. 

Given { IPs, the goal is to identify a set of top-K ranked items. Since it is a 
function of x, we denote this by f(x).* We consider a simple setting which aims 
at decoding the top-K set. One may want to consider another practically-relevant 


setting which targets the order of the top-K items as well. 


An optimization problem As we did in the previous instances, we define two 
performance metrics. One is the quantity that we are interested in characterizing 


2. — It is in contrast to the previous instances (communication, community detection and Haplotype phasing) 
which aim at decoding x. In general inference problems, what we wish to decode is a function of x, so the 
case herein belongs to the general setting. 
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the limit on: the number of pairwise comparisons, sample complexity. Due to the 
WLLN, it is concentrated around (5)pZ in the limit of n. The second is the error 
probability defined as P, := P( f(x) # f(x)). With these two metrics, we can 


formulate an optimization problem as we did earlier. Given (p, L, n): 


P*(p,L,n):= min P,. 
(P ) algorithm 


Similar to the communication and community detection problems, it is not that 
easy to derive the probability of error. It has been wide open. To make some 
progress, we focus on the asymptotic regime in which z is pretty large. This moti- 
vates us to ponder upon the following optimization. Given (p, L), 


Ppt) := min Pa 


algorithm,” 


The distinction is that 7 plays as a control variable that we optimize over. It turns 
out for some ( p, L), we can make P, arbitrarily close to zero as n —> 00. We say that 
such (p, L) is achievable. We are interested in characterizing the minimal achievable 


region R of (p, L). 


Minimal achievable region of (o, L) The minimal achievable region depends 
highly on one key metric. The key metric is the separation score between the bound- 
ary items (the Kth and (K + 1)th ranked items): 
_ XK — XK+1 


Ak := 
XK 


Here Ax indicates the normalized separation score. 

Our intuition says that the larger separation score, the easier to rank, thus yield- 
ing the smaller achievable (p, L). This is indeed the case. More concretely, it has 
been shown that for some positive constants 0 < cz < cy (Chen and Suh, 2015): 


Inn 

pL> a— SOP 8; 
MARK 
Inn 

pe < a—> = PP, +0. 
n\k 


We focus on a feasible regime in which p > na, Otherwise, the comparison graph 


is disconnected (why?), thus ranking becomes impossible. See Fig. 3.27 for an illus- 
tration of the minimal achievable region. 
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Figure 3.27. Minimal achievable region of (p,L). 


Using the above result, one can characterize the order-wise tight minimal sample 
complexity: 


y nlnn 
~= 0| -=> 
A 
K 
The standard notation y = © (x) means that there exist positive constants cı and 
c2 such that c2x < y < cyx. This result makes an intuitive sense. The larger A2, 
the easier to rank, thus reducing sample complexity. Also comparing to 


= O(n"), 


* 
total-order 


the result is promising. With top-K ranking, we can reduce the sample complexity 


2 


from ~ n° to ~ nlna. Here the standard notation y = O(x) means that there 


exists a positive constant ¢ such that y < cx. 


Look ahead In the next section, we will study an efficient algorithm that achieves 
the limit: for some positive constant c, 


In 
pL>c => P >o. 


K 
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3.10 Top-K Ranking: An Efficient Algorithm 


Recap In the previous section, we delved into a ranking problem. Our focus was 
on top-K ranking, where the objective is to retrieve the top-K ranked items, driven 
by numerous practical applications. Using the well-established BTL model (Bradley 
and Terry, 1952) as a benchmark, we transformed it into a communication prob- 
lem, where the ground-truth scores were represented by the input x, and the target 
function f(x) was the set of top-K items. Under the BTL model, every pair of 
items is observed randomly and uniformly with probability p, and each observed 


pair has Z independent copies, each following Bern(=5_) for items ¿ and j. We 


Zz 
J 
asserted that the minimum number of pairwise comparisons required for reliable 
top-K ranking, i.e., the minimum sample complexity, is: 


ninn 
o( AŽ ) (3.46) 


denotes the normalized score separation between the Kth 


XK =X; 
where Ax := “A+ 


K 
and (K + 1)th items, reflecting a difficulty level of separating the two boundary 
items. 


Outline In this section, our focus will be on a computationally efficient algorithm 
that achieves the aforementioned limit (3.46). This section consists of four parts. 
Firstly, we will introduce a performance metric that captures the ranking perfor- 
mance associated with the BTL model. Next, we will discuss one well-known algo- 
rithm that aims to maximize this ranking performance, which is a modified version 
of Google’s PageRank. After that, we will draw attention to a challenge faced by 
this variant. Finally, we will examine a more advanced algorithm that overcomes 
the challenge and achieves the minimum sample complexity stated in (3.46). 


How to estimate scores? In the BTL model, scores correspond to a rank- 
ing such that higher scores imply higher rankings. Therefore, we adopt a two-step 
approach consisting of score estimation followed by ranking based on the estimate. 
The question then becomes how to estimate the scores. To answer this, we first 
need to identify a suitable metric that quantifies the quality of an estimate. One 
commonly used performance metric is the probability of error, which is defined as: 


P, := P(x £ x). 


This is not a proper metric in our problem setting though. Why? P, is always 1 no 
matter what the decoder is. It cannot distinguish good decoders from bad ones. This 
is because x is a continuous value, which leads the success event to have measure 0. 
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In the context of estimation in which one wishes to infer a continuous quantity, there 
is a well-known metric: Mean Square Error (MSE) defined as 


1 
zls — xl? -7 de — x;)° 


We wish to develop algorithms that minimize the MSE. 


A slight variant of PageRank There is one popular algorithm that minimizes 
the MSE in the context of ranking problems (Negahban et al., 2012). That is, 
a slight variant of PageRank, the backbone of Google’s web search engine. We 
employ the variant. 

Let us explain how the algorithm works. Here are a few key observations that give 
an inspiration to the algorithm. Remember that when a pair of (ż, j) is observed, we 


are given L independent copies: y. i hes D From this, what we can compute 
is its empirical mean: 
l L 
£ rug 
Jij = 7 ya, (j) EE (3.47) 
{=1 


where E denotes the edge set of the comparison graph. One key observation is: by 
the WLLN, the empirical mean converges to its true mean: 


L 
1 (£) in prob. 8 Xi 
hee $ a = 48 
i Z2 S e (3.48) 
Also observe that 
Xi i 
x — 2 =. 7 ; (3.49) 


xitay T xit 
This formula (3.49) reminds us of the detailed balance equation in a Markov chain: 
TiPpji = TjPij 


where æ; := P(state = 2) (stationary distribution) and p; := P(statenext = 
j|statecurrent = 7) (transition probability). We can view x; as 7; an sim 3S Pji- 


This motivates us to construct a Markov chain in which transition probabil- 


ity from j to 7 takes y; which converges to ET as L —> oo. One caveat is that 
>i) can exceed 1 while >"; p;; must be 1. To resolve this, we normalize y; by the 


maiximum out-degree (maximum number of out-going edges), denoted by dmax: 


Vij 
Py 
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Yij / dmax 


1 1 
I= 7 x Ymi L= d > Ymj 
Gmax m:(m,i)EE ee m:(m,j)EE 


“Max, 


Figure 3.28. A Markov chain inspired by observations of (3.48) and (3.49). 


Also we add self-transition to ensure >), pj = 1: 


1 
sal aN > Inj: 


max 


m:(mj)EE 
Now by (3.48), one can see that 
P P 
Ti u = Tj 7 . 
Xi + xj Xi + xj 


This together with (3.49) concludes that the stationary distribution 7 converges to 
x up to some scaling as L —> oo. This series of observations leads to the following 
natural idea: 


Take x = 7. 


Then, how to compute m? To gain some insight, consider the following equation 
(called the global balance equation in the Markov chain literature): 


> Pinti = Ti. 
J 


Alternative matrix representation of this is: 
Pr=7 
where P denotes the transition probability matrix: 


Pll pl2 *** Pin 
p21 p22 `: Pon 


nl Pm aes Pnn 
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From this, we see that m is an eigenvector of P. It is well-known in the Markov 
chain literature that the stationary distribution 7 is the first eigenvector (with the 
largest eigenvalue) of the transition probability matrix. Remember that one very 
efficient and useful way of computing the first eigenvector is the power method. See 
Section 3.4. 


Additional step: Local refinement The PageRank variant taking « = m 
exhibits a great MSE performance (Negahban et al., 2012). Is this algorithm enough 
then? We are interested in a ranking, not the scores x themselves. So one may ask: 
Does the MMSE solution implies high ranking accuracy? It turns out it is not 
necessarily the case. We can readily see this from the following scenario in which 
many of the estimates are very close to the ground truth scores while only a very 
few of the estimates (lets call them outliers) are far from the ground truth. In this 
case, the MSE can be very small while ranking accuracy is not that good due to 
the outliers. See such an example in Fig. 3.29. Notice that many of the estimates 
are close to the ground-truth scores, leading to a small MSE; however, its ranking 
result (top-3) is distinct from the ground-truth top-3. From this observation, we 
see that coordinate-wise errors are required to be small enough in order to ensure 
high ranking accuracy. 

Remember the advanced algorithm for Haplotype phasing in Section 3.8. The 
advanced algorithm employs an additional step (called /ocal refinement) in an effort 
to detect any coordinate-wise error. This motivates us to apply the same method 
herein. The role of the second step in this problem setup is to detect outliers and 
then control corresponding errors in a point-wise manner. To implement this, we 
use coordinate-wise maximum likelihood estimator (MLE) which tries to mini- 
mize coordinate-wise MSE. Here is how it works. Pick up an item, say item 7. We 
then compute the coordinate-wise MLE w.r.t. item 7. Similar to Haplotype phasing, 


score 


Top-3= {1,2,3} 
Top-3= {1,2,4} 


outlier 


ground truth 
estimate 


Figure 3.29. An example in which the MSE of an estimate is small while its ranking result 
(top-3) is distinct from the ground-truth top-3. 
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computing the coordinate-wise maximum likelihood estimate (MLE) poses a chal- 
lenge. It necessitates having knowledge of the ground-truth scores of other items 
that are not accessible. Therefore, we adopt the same approach as before and utilize 
an estimate obtained in the previous step. Specifically, we compute: 


MLE A A A A 
x; = arg max O15 0a Rite 4, Xi+l» -< -> Xn) 
A 


where x; denotes the estimate obtained in the previous stage; and £ (-) indicates the 
likelihood function: 


pe Cc eee Pe eg a eee 
L 
I] [POO Pecten) 
jijye€ €=1 
(0) ®© 
IU (sts) (Gs) a 
E a+ 3; atx; 
pijye€ C=1 a 
no a 
E a+ 3; a+ 3; 
FG feE mG FA 


where the last equality is due to the definition of yj := + > = yy (see (3.47)). 
We then compute the gap between the coordinate-wise MLE and its earlier 
estimate: 


We declare x; to be an outlier if d exceeds a certain threshold which is carefully 


chosen, and then replace x; with x}""F i 


in an effort to control the corresponding 
error. Otherwise, we keep x; as it is. We repeat the same procedure for other items, 
one-by-one and step-by-step. This forms one iteration. We do multiple iterations. 


The formula of the threshold is suggested in detail from (Chen and Suh, 2015): 


be Inn Inn 
@) _ 
ô” = | (st | +( | a - (=) (3.51) 


where c indicates some positive constant and ¢ refers to the iteration index: 0 < 
t < Titer. The rationale behind this choice is two folded. First, in the initial stage 


t=0, hns # can be shown to be order-wise greater than the maximum gap between 
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MLE, 


x; and x; 


Inn 


(3.52) 


max |x% E — x;| < c, | — 
1<i<n pL 


for some positive constant c1. Second, at the end of the last stage, [a can be 


proven to be an upper bound: 


= (3.53) 


MLE (Titer J <c 
npL 


for some positive constant cz. Here x ) refers to the coordinate-wise MLE of the 
ith item at the ¢-th iteration. Due to the interest of this book, we will omit the 


proof of the bounds. 


Minimum sample complexity It has been shown in (Chen and Suh, 2015) 
that the PageRank variant together with local refinement yields: if the number of 
iterations is around In z, then for some positive constant c, 


n ninn 
(>e 5 => Poo. 
2 Az 
This completes the achievability proof. In fact, the analysis for the above is not that 
simple. It requires a variety of techniques including: (i) non-trivial linear-algebra 
tricks; (ii) Chernoff-like bounds; (iii) a bunch of inequalities such as Cauchy- 
Schwarz, triangle, Pinsker’s, etc. We omit details. We will also omit the other direc- 
tion proof: for some other constant ¢' > 0, 


P.—- 0 => (>e 


To know more know, refer to (Chen and Suh, 2015). 


Look ahead We have explored the third application of data science, namely top- 
K ranking, and examined the minimum sample complexity required for reliable 
top-K ranking. Additionally, we investigated an efficient algorithm that achieves 
the limit with a constant factor gap. In the subsequent section, we will implement 
this efficient algorithm using Python. 
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3.11 Top-K Ranking: Python Implementation 


Recap In the previous sections, we explored the problem of top-K ranking, which 
is another inference problem concerning phase transition. We examined an efficient 
algorithm that achieves the minimal sample complexity, up to a constant factor gap. 


l 
S:= (G) > Pi ani => PAH (3.54) 
ž A 


XK- XK+1 


where ¢ is some positive constant and Ag := denotes the normalized 


score separation between the Kth and (K + 1)th items. 


Outline In this section, we will delve into the algorithm by implementing it in 
Python. This section consists of four parts. First, we will provide a pseudocode 
of the algorithm that outlines all the detailed procedures. Second, we will focus 
on implementing the first stage of the algorithm, which is the PageRank variant. 
This variant finds the first eigenvector of the transition probability matrix in the 
relevant Markov chain. We will also evaluate the performance of the algorithm by 
calculating the mean squared error (MSE) between the ground truth score vector 
x and its estimate x. Third, we will incorporate the second stage of the algorithm 
(local refinement) to implement the advanced version. Finally, we will compare the 
MSE performance of the two algorithms. 


Pseudocode of the efficient algorithm The efficient algorithm consists of 

two stages: (i) obtaining an initial estimate of x via the spectral algorithm (finding 

the first eigenvector of the transition probability matrix of a Markov chain); and 

(ii) performing local refinement of the initial estimate via the coordinate-wise MLE. 

Details of the first stage are given as below. 

0, 

f; y 

2. Compute the transition matrix P = [pj]: pj = ze, (i,j) € E. Here dmax = 
max} <j<n I{(47) : (6j) € E Vi}. For the (j, j) entry, 


1. Given fs and E, compute yj for all (4,7) € E: yj = l Ziy 


1 
mmi a Imj: 


Bigi m:(mj)EE 


3. Compute the stationary distribution 7 to obtain x = nr. 


In Step 3, we multiply m by 7 for a proper scaling. This way, an estimated score xO 


does not scale with 7. The second stage for local refinement is described as follows. 
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4. Fort € {0,..., Titer — 1}, iterate the following: for each 1 <i < n, 
a arg max In L(x”, zi x), 4, x, 2) 


1 3 


(+1) 


i 


a |x; E — x | > 6; 


x), jat E = x| < ©. 
Here In £(-) indicates the log-likelihood function: 


InLw,..., th 2), ,...,x0) 


(Q) 
a x; 
Z > Ly; ln EC + L(1 — yg) In -=E 
jAije€ ARN anma 
=L 5 (1 -y;j)lnx® +L > [zjlna— Ina +s}, 


J:i j)EE J:i j)EE 


and 0” denotes the threshold for comparison: 


nfo 


for some positive constant c. 


Python implementation of the PageRank variant We implement the 
PageRank variant (corresponding to Steps 1,2 and 3 in the above). For illustrative 
purpose, we first consider a simple setting where interested parameters are small 
numbers. 

import numpy as np 

n=4 

K=2 # top-K ranking 

p=2*np.log(n)/n # for graph connectivity 

L=100 


amie nH to ensure graph connectivity (p > Inn), The parameter L is 


We set p = 
chosen as 100. We generate the ground truth score vector x. To emphasize the sep- 
aration score between the Kth and (K + 1)th scores, we set xx = 0.8 andxx4) = 

SET as 


0.8 — gap such that Ax is fixed as = . We choose gap = 0.1 to have 
Ax = 0.125. On the other hand, we penciite i. .. -3 XK—1) and (XK+2 - - -> Xn) 
uniformly distributed from [0.8, 1] and [0.8 — gap, 0.5], respectively. 
# Generate the ground truth score vector 
def scoreVector(gap,n,K): 
# Generate top-K items btw 0.8 and 1 
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xt=np.random.uniform(0.8,1,K) 
xt=sorted(xt,reverse=True) 

xt[K-1]=0.8 

# Generate n-K bottom items btw 0.5 and 0.8-gap 
xb=np.random.uniform(0.5,0.8-gap,n-K) 
xb=sorted(xb,reverse= True) 
xb[O0]=0.8-gap 
xX=np.concatenate((xt,xb)) 

# score scaling 

x=x/sum(x)*n 

# Compute Deltak 
DeltaK=(x[K-1]-x[K])/x[K-1] 

# sample complexity 

return x,DeltaK 


gap=O0.1 
x,DeltaK=scoreVector(gap,n,kK) 
S=n*(n-1)/2*p*L 

# claimed limit 
limit=n*np.log(n)/(DeltaK**2) 
print(x) 

print(Deltak) 

print(S) 

printclimit) 


[11848639 1.04924926 0.9180931 0.84779374] 
0.12500000000000003 

415.88830833596717 

354.8913564466918 


In this setup, the sample complexity is above the claimed limit. 
Next, we construct a comparison graph. 


from scipy.stats import bernoulli 


def genGraph(n,p): 
# Generate a comparison graph 
Bern = bernoulli(p) 
G = Bern.rvs((n,n)) 
# make G symmetric 
for i in range(n): 
for j in rangedi,n): G[j,iJ=GLi,jJ 
for i in range(n): GLi,iJ=O 
# compute dmax=max_j |{G,):G,) in G forall i}| 
dmax=max(sum(G)) 
return G,dmax 
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G,dmax=genGraph(n,p) 
print(G) 
print@dmax) 


[[0 11 1] 
[101 1] 
[1 10 1] 
[1 1 1 OJ] 
3 


Usi e 
sing the comparison graph, we generate observations y; 
Jij and pj; as per Steps 1 and 2 in the pseudocode. 


from numpy.random import binomial 


def transitionMatrix(x,n,L,G,dmax): 
# initialization 
Y=np.zeros((n,n)) 
P=np.zeros((n,n)) 
# construct transition matrix 
for i in range(n): 
for j in rangeCi,n): 
if GLi,jJ==1: 


Y[i, j ]=binomial(L,x[i]/(x[i]+x[j]))/L 


YL,i]=1-Y[i,j] 

PLi j]=Y[i,j]/dmax 

PI), i]=Y[j,i]/dmax 
# add self-transition 
for i in range(n): PLi, i]=1-sum(P[;i)) 
return P,Y 


P,Y=transitionMatrix(x,n,L,G,dmax) 
printcy) 
print(P) 


[[O. 0.51 0.58 0.58] 
[0.49 O. 0.55 0.63] 
[0.42 0.45 O. 0.51] 
[0.42 0.37 0.49 O. Jj] 


[[0.55666667 0.17 0.19333333 0.19333333] 
[0.16333333 0.55666667 0.18333333 0.21 
[0.14 0.15 0.46 0.17 
[0.14 0.12333333 0.16333333 0.42666667]] 
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’s and then compute 


In order to find the first eigenvector of P, we employ the power method (see 


Section 3.4). Below we copy the Python code of the power method. 


def power_method(A, eps=le-5): 
# A computionally efficient algorithm 
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# for finding the principal eigenvector 
# Choose a random vector 

v = np.random.randn(n) 

# normalization 

v = v/np.linalg.norm(v) 


prev_v = np.zeros(len(v)) 

t=O 

while np.linalg.norm(prev_v-v) > eps: 
prev_v =v 
v = np.array(np.dot(A,v)).reshape(-1) 
v = v/np.linalg.norm(v) 
t+=1 

return v 


xO=power_method(P) 
# scaling 
xO=x0/sum(xO)*n 
print(xO) 

print(x) 


[1.17192056 1.1662718 0.87537578 0.78643185] 
[1.1848639 1.04924926 0.9180931 0.84779374] 


We see that x has the same order as that of x. 
Now we perform an extensive experiment for a large value of 7, say n = 1000, 


and for a range of p spanning the claimed limit p* = 


import numpy as np 

from scipy.stats import bernoulli 
from numpy.random import binomial 
import matplotlib.pyplot as plt 


n=1000 
K=5 # top-K ranking 
L=20 
p_range=np.linspace(O0.02,0.12,30) 
#lin(n)/n ~0.007 
#nInc(n)/(Deltak**2)/(n*(n-1)*L/2) ~0.0089 
gap=0.1 
# generate the ground truth score vector 
x,DeltaK=scoreVector(gap,n,K) 
MSE = np.zeros_like(p_range) 
ITER=20 
for idx,p in enumerate(p_range): 

for k in range(ITER): 

# Generate a comparison graph 
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G,dmax=genGraph(n,p) 

# Transition probability matrix 
P,Y=transitionMatrix(x,n,L,G,dmax) 

# Compute the stationary distribution 
xO=power_method(P) 

# score scaling 

xO=x0/sum(xO)*n 

# Compute MSE between x and xO 

MSE[idx]+= \ 
sum(np.square(x-xO))/sum(np.square(x))/ITER 


plimit=n*np.log(n)/(DeltaK**2)/(n*(n-1)*L/2) 
p_norm=p_range/plimit 


plt.figure(figsize=(5,5), dpi=100) 
plt.plot(o_norm,MSE,label=’ PageRank variant’) 
plt.yscaleClog’) 

plt.title” Normalized MSE’) 

plt.legendQ 

plt.gridClinestyle=":’, linewidth=0.5) 

plt.showQ 


Fig. 3.30 demonstrates the MSE performance of the PageRank variant as a func- 
tion of p. The range of p is set so as to exceed the graph connectivity threshold nz 
as well as to span the claimed limit 0.4 < i < 2.7. Notice that the MSE decreases 
with an increase in 7. 


Additional stage: Local refinement We investigate a more advanced algo- 
rithm that employs local refinement addigionally (Step 4 in the above pseudocode). 
To implement this, we need to solve the optimization problem taking the log- 
likelihood as an objective function: 

MLE 


x; = arg maxln £;(a). 


where we use a simpler notation for the log-likelihood: 
L£;(a) := LO, vee xa, x, sie 0), 


One key observation is that the objective function is concave in the optimization 
variable a. Check this in Prob 9.4. Hence, it is a convex optimization problem. As 
mentioned in Section 1.5 and Prob 2.4, one can solve convex optimization via 
the Lagrange multiplier method. Sometimes, the method yields the closed form 
solution. But it is not always the case. 
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Figure 3.30. The MSE performance of the PageRank variant as a function of observation 
probability p:n = 1000, L = 20, K = 5 and Ax = 0.125. 


al ) k-th estimate 


a+) © al) + aY f(al®) 


Figure 3.31. How gradient ascent works. 


Gradient ascent Our problem has no closed-form solution. Hence, we employ 
an algorithm that allows us obtain the solution numerically. One prominent 
algorithm that yields the numerical solution is gradient ascent. Simply put, it is an 
algorithm that finds the unique stationary point when the interested function is 
concave. It is called gradient descent when the objective function is convex. Here is 
how the algorithm works. See Fig. 3.31. It is an iterative algorithm. Suppose that at 
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the Ath iteration, we have an estimate of a*, say a, We then compute the gradient 
of the function evaluated at the estimate: Vf(a). Next we update the estimate 
along the same direction w.r.t. the gradient: 


a’) e— a® 4 gp vF(a®) (3.55) 


where a) > 0 indicates the learning rate (or called a step size) that usually decays 
likea® = a If you think about it, this update rule makes an intuitive sense. Sup- 
pose a“) is placed left relative to the optimal point a*, as in the two-dimensional 
case’ illustrated in Fig. 3.31. Then, we should move a“ to the right so that it 
becomes closer to a*. The update rule actually does this, as we add by a) Vf (a). 
Notice that Vf(a“)) points to the right direction given that a“) is placed left rel- 
ative to a*. We repeat this procedure until it converges. It turns out: as k — 00, it 
converges: 


a”) —» a, (3.56) 


as long as the learning rate is chosen properly, like the one decaying exponentially. 
We will not touch upon the proof of this convergence. In fact, the proof is difficult. 
There is a big field in statistics which intends to prove the convergence of a variety 
of algorithms (if the convergence holds). 


Python implementation of local refinement We apply gradient ascent to 
implement the coordinate-wise MLE. To this end, we compute the gradient of the 
objective function: 


d {1 _ Ji 1 
Z (Finca) = Po (eae 


fline€ 


Then, the update rule for a“) reads: 


(k+1) H (A Iü 1 
a — a +4 5 “® o ð | (3.57) 
EGJE a x 


See below for a code implementation of the coordinate-wise MLE. We first set up 
parameters and run the PageRank variant to obtain an initial estimate. 


3. Ina higher-dimensional case, it is difficult to visualize how a” is placed. Hence, we focus on the two- 
dimensional case. It turns out that gradient ascent works even for high-dimensional settings although it is 
not 100% intuitive. 
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import numpy as np 


n=4 

K=2 # top-K ranking 

p=2*np.log(n)/n # for graph connectivity 
L=100 

gap=0.1 


# generate the ground truth score vector 
x,DeltaK=scoreVector(gap,n,K) 

# Generate a comparison graph 
G,dmax=genGraph(n,p) 

# Transition probability matrix 
P,Y=transitionMatrix(x,n,L,G,dmax) 

# Compute the stationary distribution 
xO=power_method(P) 

# score scaling 

xO=x0/sum(xO)*n 

print(xO) 


[1.22440733 0.90509149 1.06241023 0.80809095] 


def coordinateMLE(x0,G,Y,xmin,xmax): 
n=len(G[O]) 
alpha=0.01 
nITER=10 


# initialization for xMLE 
xMLE=np.zeros_like(xO) 


for i in range(len(G[:,0])): 
# compute the ith coordinate MLE 
a=xO[i] # initialization 
for k in range(nITER): 
# compute the gradient of In(L_i(a))/L 
grad=O # initialization 
for j in rangeClen(G[i])): 
if G[ijJ==1: grad+=Y[i,j]/a-1/(at+xO[j]) 
# update "a “{(k)}” 
a = atalpha*np.sign(grad) 
# projection to [x_min,x_max] 
if a<xmin: xMLE[iJ=xmin 
elif a>xmax: xMLE[i]=xmax 
else: xMLE[iJ=a 


return xMLE 
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xMLE=coordinateMLE(x0,G,Y,min(x),max(x)) 
print(xO) 

print(xMLE) 

print(xO-xMLE) 


[1.22440733 0.90509149 1.06241023 0.80809095] 
[1.20499993 0.88509149 1.06241023 0.82809095] 
[ 0.0194074 0.02 O. -0.02 ] 


We employ a simplified version of gradient ascent where we take only the sign 
of the gradient and set a = 0.01. This way, we can do an efficient update. Other- 
wise, the range of the gradient scales with 7, yielding an unstable update. We also 
apply the projection of the MLE solution so that the estimate is within [mins Xmax]. 
Notice in the above that the coordinate-wise MLE is slightly different from the ini- 
tial estimate. 

Next, we employ 5 to decide whether to take the MLE solution: 
e [xM E a x | > 6, 


1 


get) = 


1 


3.58 
x) | |xMLE _ x | < AO) ( ) 


50 =e neu 1 laa Inn 
npl 2° pL npL 


We set the hyperparameters c = 0.1 and Titer = 7. In this case, the range of 5”) is: 
Titer=7 
t=np.arange(Titer) 
c=0.1 
delta= c*(np.saqrt(np.log(n)/(n*p*L)) \ 
+ 1/(2**t)*(np.sqrt(np.log(n)/(p*L)) \ 
- np.sqrt(np.log(n)/(n*p*L)))) 
printCdelta) 


where 


[0.01414214 0.0106066 0.00883883 0.00795495 0.00751301 
0.00729204 0.00718155] 


Under this setting, we iterate the coordinate-wise MLE Titer times. 


def iter_coordinateMLE(x0,G,Y,p,L,Titer,xmin,xmax): 
t=np.arange(Titer) 
n=len(G[O]) 
c=0.01 
delta= c*(np.saqrt(np.log(n)/(n*p*L)) \ 
+ 1/(2**t)*(np.sqrt(np.log(n)/(p*L)) \ 
-np.sqrt(np.log(n)/(n*p*L)))) 
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xt=xO 
for it in range(CTiter): 
xtl=np.zeros_like(xt) 
xMLE=coordinateMLE(xt,G, Y,xmin,xmax) 
for i in range(n): 
if no.abs(xMLE[i]-xt[i])>deltaLit]: 
xt1[iJ=xMLE[i] 
else: xt1[iJ=xtLi] 
xt=xtl 


return xt 
x_est=iter_coordinateMLE(x0O,G,Y,p,L,7,min(x),max(x)) 


print(xO) 
print(x_est) 
print(xO-x_est) 


[1.22440733 0.90509149 1.06241023 0.80809095] 
[1.20499993 0.88509149 1.04241023 0.80809095] 
[0.0194074 0.02 0.02 O. ] 


We see that the MLE solution is further away from the initial estimate. 
Now we compare the performance of the PageRank variant and the advanced 
algorithm for a setting where n = 1000 and p spans the claimed limit. 


import numpy as np 

from scipy.stats import bernoulli 
from numpy.random import binomial 
import matplotlib.pyplot as plt 


n=1000 

K=5 # top-K ranking 

L=20 

Titer=7 
p_range=np.linspace(0.02,0.12,30) 
HND ~0.007 
#nIn(n)/(Deltak**2)/(n*(n-1)*L/2) ~0.0089 


gap=0.1 
# generate the ground truth score vector 
x,DeltaK=scoreVector(gap,n,K) 


MSE] = np.zeros_like(p_range) # for pagerank 
MSE2 = np.zeros_like(p_range) # for advanced algorithm 
ITER=20 
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for idx,p in enumerate(p_range): 
for k in range(ITER): 
# Generate a comparison graph 
G,dmax=genGraph(n,p) 
# Transition probability matrix 
P,Y=transitionMatrix(x,n,L,G,dmax) 
# Compute the stationary distribution 
xO=power_method(P) 
# score scaling 
xO=x0/sum(xO)*n 


xest= \ 
iter_coordinateMLE(x0,G,Y,p,L, Titer, min(x),max(x)) 

# Compute MSE for pagerank and advanced algorithm 
MSE1[idx] += \ 
sum(np.square(x-xO))/sum(np.square(x))/ITER 
MSE2[idx] += \ 
sum(np.square(x-xest))/sum(np.square(x))/ITER 


plimit=n*np.log(n)/(DeltaK**2)/(n*(n-1)*L/2) 
p_norm=p_range/plimit 


plt.figure(figsize=(5,5), dpi=100) 
plt.plot(p_norm,MSE1,label=’PageRank variant’) 
plt.plot(p_norm,MSE2,label=’Advanced algorithm’) 
plt.yscale(log’) 

plt.title” Normalized MSE’) 

plt.legendQ 

plt.gridClinestyle=":’, linewidth=0.5) 

plt.showQ 


Fig. 3.32 shows the MSE performances of the PageRank variant and the 
advanced algorithm. We see an improvement with local refinement. 
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Figure 3.32. The MSE performances of the PageRank variant and the advanced algorithm 
as a function of observation probability o: n = 1000, L = 20, K = 5 and Ax = 0125. 


Look ahead In the previous sections, we discussed the application of top-K rank- 
ing and studied an efficient algorithm that achieves the minimum sample complex- 
ity with some constant factor gap. Additionally, we implemented the algorithm 
using Python. As noted in the beginning of Part III, information-theoretic con- 
cepts like KL divergence and mutual information play a significant role in machine 
learning. In the next section, we will explore one such application. 
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Problem Set 9 


Prob 9.1 (Ranking from pairwise comparisons) Suppose there exists a 
ground-truth ranking, say R, of n items: e.g., R = {1 > 2 >--- > n}. We 
wish to identify the ranking from pairwise comparisons, e.g., item 7 is preferred 
over item j. Let xj be a pairwise comparison between items and j: 


xj = l{item 7 > item j}. 


Suppose pairwise comparisons are observed uniformly at random. Each comparison 
for items (ż, 7) is observed with probability p € [0, 1] independently over all (ż, 7)’s: 


_ Xijə w.p. Pp 
Jä = 
e wp.l—p. 


Given yj’s, we wish to decode R. Let R be a decoded ranking. Let P, be the prob- 
ability of error: P, := P(R Æ R). Consider an optimization problem. Given p, 


P*(p):= min P, 
e p) algorithm, ⁄ j 


For € € (0, 1], what is Pž(1 — €)? Also explain why. 


Prob 9.2 (Ranking from pairwise comparisons) Suppose there exists a 
ground-truth ranking, say R, of n items. We wish to identify the ranking of the 
n items from pairwise comparisons, e.g., item 7 is preferred over j. Suppose pair- 
wise comparisons are given uniformly at random: items (¿, j) are compared with 
probability p € [0,1] independently over all (i, j). Let R be a decoded ranking. 
Let P, be the probability of error: P, := P(R # R). What is the number of pair- 
wise comparisons on average required to make the probability of error arbitrarily 
close to 0 as n — œ? 


Prob 9.3 (Top-K ranking) Consider the BTL model (Bradley and Terry, 1952) 
in Section 3.9. Let x := [x], x2, . . . , Xn] be the ground-truth score vector of n items 
where x; € R*. Assume that each pair of any two items is observed uniformly at 
random w.p. p and the observed pair has Z independent copies: 


p-n) vor 
á e, w.p. 1 — p 


where P's are i.i.d for all pairs (ż, j) and £ € {1,2,..., L}. Given I's we wish 


to decode the set f(x) of top-K ranked items (top-K partitioning). Let Fe) bea 
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decoded top-K set. Let P, be the probability of error: P, := P(f (x) # f(x)). In 


Section 3.9, we claimed that the minimum sample complexity required for reliable 


o(“") 
Ak 


(a) Explain what the notation ©(-) means. 


top-K ranking is: 


where Ag := ~S—*+!. 
XK 
(4) In Section 3.9, we claimed that the minimum sample complexity is a 
promising result, when comparing it to the total ordering case. Explain why. 
(c) Show that if p < iia, P, cannot be made arbitrarily 0 no matter what we 
do and whatsoever. 


(d) Given that the (7, j) pair is observed, compute: 


(e) Describe the PageRank variant (that we learned in Section 3.10). 
(f) In Section 3.10, we introduced an additional stage, called local refinement. 
Explain why this stage is employed. 


Prob 9.4 (Concavity) Consider a function: 


iyi l RE: (3.59) 


where 0 <x < 1,0 < 6<1and0 <c < 1. Prove that f(x) is concave in x. 


Prob 9.5 (A bounding technique) Consider a system with an input X € 
{1,2,...,} and an output Y € Y. Here Y denotes a discrete alphabet. Let P; be 
the probability distribution w.r.t. Y conditioned on X = 7. Suppose X is uniformly 
distributed. 


(a) Show that 
l M M 
IX: Y) < Fa DD KEPP) 
i=1 j=l 


where KL(-||-) indicates the KL divergence defined w.r.t. log base 2. 
(4) Find a condition under which the equality holds in the above. 
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Prob 9.6 (True or False?) 
(a) Let A e R”*” bea matrix with m positive eigenvalues 1;’s and eigenvectors 
vjs: 
A:= Aivivi + Aowyvd +. ++ Ammy 
where 21 > Az > A3 > -+ > Am and v;’s are orthonormal: vi vj =1l{i= 
j}. Let v e R” be some non-zero vector. Then, 
Afv 


—— — v] 
Afv]? 


ask > OO. 
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3.12 Supervised Learning: Connection with Information 
Theory 


Three key notions Part I focused on three essential notions in information the- 
ory: entropy, mutual information, and Kullback-Leibler divergence. In Parts I and 
II, we examined these notions in the context of Shannon’s source and channel cod- 
ing theorems. Entropy provides a concise way to determine the highest possible 
compression rate for an information source, while mutual information is a suitable 
metric for defining channel capacity. 


Role of the notions in data science Throughout the remainder of Part 
III, we will delve into three different applications of machine learning and deep 
learning where the key notions we have discussed are crucial. Specifically, we 
will examine: (i) how entropy plays a significant role in one of the most preva- 
lent methods of machine learning, known as supervised learning; (ii) the impor- 
tance of KL divergence in another popular technique, unsupervised learning; 
and (iii) how mutual information guides the design of machine learning algo- 
rithms that promote fairness for disadvantaged groups in comparison to advantaged 
ones. 


Outline In the upcoming sections, our focus will be on the role of entropy in 
supervised learning. We will begin by exploring what supervised learning is and 
then formulate an optimization problem that corresponds to it. We will then 
demonstrate that entropy plays a central role in designing an objective func- 
tion that leads to an optimal architecture for the optimization problem. Finally, 
we will learn how to solve this optimization problem using a powerful algo- 
rithm known as gradient descent, which is a slight variation of gradient ascent 
that we covered in Section 3.11. To aid in the implementation, we will use 
TensorFlow, which is one of the most intuitive and widely-used deep learning 
frameworks. If you are unfamiliar with TensorFlow, please refer to a tutorial in 
Appendix B. 


Machine learning Machine learning is about an algorithm that a computer sys- 
tem can execute by following a set of instructions. More formally, it refers to the 
study of algorithms used to train a computer system, enabling it to perform a spe- 
cific task. A pictorial illustration of this concept can be found in Fig. 3.33. The 
goal is to develop a computer system (referred to as a machine) that can take in an 
input, denoted as x, and produce an output, denoted as y. This system or function 
is designed to carry out a task of interest. For instance, if a task is legitimate-emails 
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computer system 
(machine) 


t training 


algorithm (together w/ data) 


Figure 3.33. Machine learning is the study of algorithms which provide a set of instruc- 
tions to a computer system so that it can perform a specific task of interest. Let the input 
x indicate information employed to perform a task. Let the output y denote a task result. 


filtering against spams, x could be multi-dimensional quantities‘: (i) frequency of 
a keyword like dollar signs $$$; and (ii) frequency of another keyword, say win- 
ner. And y could be an email entity, e.g., y = +1 indicates a legitimate email 
while y = —1 denotes a spam. In machine learning, such y is called a /abel. Or 
if an interested task is cat-vs-dog classification, x could be image-pixel values and 
y is a binary value indicating whether the fed image is a cat (say y = 1) or a dog 
(y = 0). 

The essence of machine learning lies in designing algorithms that can train a 
computer system to perform a desired task effectively. This involves using data as a 
crucial component in the algorithm design process. 


A remark on the naming From a machine's perspective, a machine learns the 
task from data. Hence, it is called machine learning, ML for short. This naming 
was originated in 1959 by Arthur Lee Samuel (Samuel, 1967). See Fig. 3.34. 

Arthur Samuel is one of the pioneers in Artificial Intelligence (AI), which encom- 
passes machine learning as a sub-field. AI involves the study of creating intelligence 
in machines, which differs from the natural intelligence observed in intelligent 
beings such as humans and animals. 

One of Samuel’s notable accomplishments in the early days of AI was the devel- 
opment of a computer player for the board game checkers (see the right figure in 
Fig. 3.34). He introduced several algorithms and concepts during the process of 
creating this computer program. These algorithms ultimately served as the foun- 
dation for AlphaGo (Silver et al., 2016), a computer program developed for the 


4. In machine learning, such quantities are called features. These refer to key components of data that well 
describe their characteristics. 
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Arthur Samuel ’59 


Figure 3.34. Arthur Lee Samuel is an American pioneer in artificial intelligence. One of 
his achievements in early days is to develop computer checkers which later formed the 
basis of AlohaGo. 


board game Go. AlphaGo went on to defeat Lee Sedol, a professional 9-dan player, 
in 2016, winning 4 out of 5 games (News, 2016). 


Mission of machine learning The ultimate objective of machine learning is to 
attain artificial intelligence. Thus, it can be regarded as one of the methodologies 
for achieving AI. As depicted in the block diagram in Fig. 3.33, the aim of ML is to 
develop an algorithm that enables the trained machine to exhibit behavior similar 
to intelligent beings. 


Supervised learning There are some methodologies which help us to achieve 
the goal of ML. One specific yet popular method is: 


Supervised Learning. 


Supervised learning involves the process of learning a function f (x) (which repre- 
sents the machine’s functionality) with the assistance of a supervisor, as illustrated 
in Fig. 3.35. The supervisor plays a crucial role in this process by providing input- 
output samples, which serve as the data used to train the machine. Typically, these 
input-output samples are represented as: 


1O, yO (3.60) 


where (x), y) indicates the ith input-output sample (or called a training sample 
or an example) and m denotes the number of samples. Using this notation (3.60), 
supervised learning is to: 


Estimate f (-) using the training samples {(x, OYL. (3.61) 


Optimization An effective approach to estimate f(x) is through optimization. 
To fully grasp this concept, let’s delve into the formal definition of optimization. 
Optimization refers to the selection of an optimization variable that minimizes or 
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x f(z) 


t 


Figure 3.35. Supervised Learning: A methodology for designing a computer system f(-) 
with the help of a supervisor which offers input-output pair samples, called a train dataset 
{0 y)}™.. 


maximizes a particular quantity of interest while taking any relevant constraints 
into account. Two significant factors come into play in this definition. Firstly, the 
optimization variable is a multi-dimensional quantity that can influence the quan- 
tity of interest and is subject to our design. Secondly, the quantity of interest that 
we aim to minimize or maximize is known as the objective function, and it is a 
one-dimensional scalar. 


Objective function To figure this out, we need to know about the objective that 
supervised learning wishes to achieve. In view of the goal (3.61), what we want is: 


y wf), Wie {1,..., m}. 


A natural question arises. How to quantify closeness (reflected in the “~” notation) 
between the two quantities: y and f(x)? One common way that has been used 
in the field is to employ a function, called a Joss function, usually denoted by: 


LË, fP). (3.62) 


One property that the loss function £(-, -) should satisfy is that it should be small 
when the two arguments are close, while being zero when the two are identical. 
Using the loss function (3.62), one can formulate an optimization problem as: 


min >) £(y, f @)). (3.63) 
fo S 


How to introduce optimization variable? Unfortunately, there is no variable. 
Instead we have a different quantity that we can optimize over: the function f (-) 
that appears in (3.63). How to deal with such function optimization? There is one 
typical approach in the field. The approach is to specify a function class (e.g., linear 
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or quadratic), represent the function with parameters (or called weights), denoted by 
w, and then consider the weights as an optimization variable. Taking this approach, 
one can translate the problem (3.63) into: 


min YE fol) (3.64) 


i=1 


where f (x®) denotes the function f (x) parameterized by w. 

The above optimization problem depends on how we define the two functions: 
(i) fue) w.t.t. w; and (ii) the loss function €(-,-). In machine learning, lots of 
works have been done for the choice of the functions. 


Introduction of neural networks Around at the same time when the ML field 
was founded, one architecture was suggested for the first function f,,(-) in the con- 
text of simple binary classifiers in which y takes one among the two options. The 
architecture is called: 


Perceptron, 


and was invented in 1957 by one of the pioneers in AI, named Frank Rosen- 
blatt (Rosenblatt, 1958). Frank Rosenblatt, a psychologist, was intrigued by the 
workings of the brains of intelligent beings. His research into this area led him to 
develop the Perceptron, which provided valuable insights into neural networks. 


How brains work The architecture of Perceptron was inspired by the structure of 
the brain, which contains numerous electrically excitable cells, known as neurons; 
see Fig. 3.36. Each neuron is represented by a red circle in the figure, and there 
are three neurons shown. There are three key features of neurons that influenced 
the design of Perceptron. The first is that neurons possess electrical properties, and 
therefore have a voltage. The second feature is that neurons are interconnected with 


neuron 
(voltage) 

activation 

voltage t => 1 

E= i 


Figure 3.36. Neurons are electrically excitable cells and are connected through synapses. 
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falz) 


=} 1 ifwtz>th 


0 ow. 


Figure 3.37. The architecture of Perceptron. 


one another through channels called synapses, which facilitate the transmission of 
electrical voltage signals between neurons. Depending on the connectivity strength 
of a synapse, a voltage signal from one neuron to another can increase or decrease. 
Finally, a neuron produces an action, known as activation, by generating an all-or- 
nothing pulse depending on its voltage level. If the voltage level is above a specific 
threshold, it generates an impulse signal with a certain magnitude, say 1; otherwise, 
it produces nothing. 


Perceptron Frank Rosenblatt proposed the Perceptron architecture based on the 
three properties mentioned earlier, as depicted in Fig. 3.37. Let x denote an n- 
dimensional real-valued signal, where x is represented as [x1,x2,... Xn] 2. Each 
component x; is distributed to a neuron, and x; is interpreted as the voltage level of 
the 7th neuron. The voltage signal x; is transmitted through a synapse to another 
neuron located on the right in the figure (indicated by a large circle). The voltage 
level can either increase or decrease, depending on the strength of the synapse’s 
connectivity. To account for this, a weight w; is multiplied to x;, so that wx; is the 
delivered voltage signal at the terminal neuron. Rosenblatt introduced an adder that 
aggregates all voltage signals from multiple neurons to model the voltage signal at 
the terminal neuron. He observed empirically that the voltage level at the terminal 
neuron increases as more neurons are connected. 


wyxy + wrx, + <- © + Wer, = wi x. (3.65) 
Lastly in an effort to mimic the activation, he modeled the output signal as 


1 ifw!x > th, 


ful) = (3.66) 


0 ow. 
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where “th” indicates a certain threshold level. It can also be simply denoted as 
fo(x) = Uw? x > th}. (3.67) 


Activation Taking the Perceptron architecture in Fig. 3.37, one can formulate 
the optimization problem (3.64) as: 


min X OË, Uw! x > th}). (3.68) 


i=1 


This is an initial optimization problem that was proposed, but an issue in solv- 
ing this optimization was discovered. The problem lies in the objective function 
containing an indicator function, which makes it non-differentiable. This non- 
differentiability presents a difficulty in solving the problem. The reason for this is 
that the prominent algorithm that we learned in Section 3.11, gradient ascent (or 
descent), involves derivative operations, which cannot be applied when the func- 
tion is non-differentiable. 

To address this problem, one common approach that has been taken in the field 
is to approximate the activation function. There are several ways to achieve this 
approximation. From below, we will explore one popular method. 


Logistic regression The popular approximation approach is to take a smooth 
transition from 0 to 1 for the abrupt indicator function: 


1 


fox) = T 


(3.69) 


Notice that fw (x) ~ 0 when wT 


increase in w? x; later grows logarithmically; and finally saturates as 1 when w? x is 


x is very small; it then grows exponentially with an 


very large. See Fig. 3.38. The function (3.69) is a very popular one used in statistics, 


1lte-# 


x 


Lal 
— fez: 


Figure 3.38. Logistic function: ø (zZ) 
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called the ogistic’? function (Garnier and Quetelet, 1838). There is another name: 
the sigmoid® function. 

There are two good things about the logistic function. First it is differentiable. 
Second, it can serve as the probability for the output in the binary classifier, e.g., 
P(y = 1) where y denotes the ground-truth label in the binary classifier. So it is 
interpretable. 


Look ahead Assuming the logistic activation function, what would be an appro- 
priate loss function? In a certain sense, the design of an optimal loss function is 
closely related to the concept of entropy. In the following section, we will explore 
how this entropy-related concept is used to design the optimal loss function. 


5. The word Jogistic comes from a Greek word which means a slow growth, like a logarithmic growth. 


6.  Sigmoid means resembling the lower-case Greek letter sigma, S-shaped. 
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3.13 Supervised Learning: Logistic Regression and Cross 
Entropy 


Recap In the previous section, we formulated an optimization problem for super- 
vised learning based on the Perceptron architecture: 


min X £(y9, fol). (3.70) 
i=1 
As an activation function, we considered a logistic function: 
1 
fulx) = ie (3.71) 


We then claimed that an entropy-related concept plays a role in the design of the 
optimal loss function. 


Outline This section will demonstrate the validity of the claim in three steps. 
Firstly, we will explore the definition of the optimal loss function. Secondly, we 
will examine the role of entropy in the design of the optimal loss function. Finally, 
we will discuss how to solve the optimization problem. 


Optimality in a sense of maximizing likelihood Logistic regression is a 
binary classifier that uses the logistic function (3.71). Fig. 3.39 provides a visual 
representation of logistic regression. It should be noted that the output 7 of logistic 
regression falls between 0 and 1. 


O<7< 1, 


Hence, one can interpret this as a probability quantity. The optimality of a clas- 
sifier can be defined under the following assumption inspired by the probabilistic 
interpretation: 


Assumption : 7 = P(y = 1|x). (3.72) 


logistic 
regression 


ae 


Figure 3.39. Logistic regression. 


Supervised Learning: Logistic Regression and Cross Entropy 273 


To understand what it means, consider the likelihood of the ground-truth 
classifier: 


PCL yO ix E: (3.73) 


Notice that the classifier output 7 is a function of weights w. Hence, assum- 
ing (3.72), the likelihood (3.73) is also a function of w. 

We are now ready to define the optimal w. The optimal weight, say w*, is defined 
as the one that maximizes the likelihood (3.73): 


w* := arg max PAJO YLP. (3.74) 


There are other ways to define the optimality. Here, we employ the maximum like- 
lihood principle, the most popular choice. This is exactly where the definition of 
the optimal loss function, say €*(-,-) kicks in. We say that €*(-,-) is the one that 
satisfies: 


m 
arg min > t* (79,9) = arg max PU yO}, lx). (3.75) 
i=l 
As mentioned earlier, an entropy-related concept appears in €*(-,-). From below, 


we will figure this out. 


Finding the optimal loss function €*(-,-) Usually samples are obtained from 
different data x®’s. Hence, it is reasonable to assume that such samples are inde- 
pendent with each other: 


O, yO), are independent over 7. (3.76) 
Under this assumption, we can rewrite the likelihood (3.73) as: 


(ym 1pym_\ @ PUG, 9D) 
PAT ee) PETE) 


o Z PE? O) 
JHS P(x) 


2 TPO) 


i=1 


(3.77) 


where (4) and (c) are due to the definition of conditional probability; and (b) 
comes from the independence assumption (3.76). Here P(x, y) denotes the 
probability distribution of the input-output pair: 


P(x, y) := PX =x9, Y = y) (3.78) 
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where X and Y indicate random variables of the input and the output, respectively. 
Recall the assumption (3.72) made with regard to j: 


x= Py = 1x). 


This implies that: 


y=1: PO) =J; 


y=0: PQO|x) =1-—9. 
Hence, one can represent P(y|x) as: 
POl) = 70-9). 
Using the notations of (x, y) and 7, we get: 


© G) 


PË |x) = GYP a -3O 


Plugging this into (3.77), we get: 


PL yO Hy HY) 


= [Pe T] Gy" = 9)”. py 
11 i=l 
This together with (3.74) yields: 
w* := arg max [JGO (1 — jO’ 
i=1 
® arg max y log 9 + (1 — y) log — 7) (3.80) 
i=l 


È arg min $ -30 logj® — (1 — y®) log — 5%) 


i=1 


where (a) comes from the fact that log(-) is a non-decreasing function and 


m A(i)\ y) 
Mig Y 


the objective while replacing max with min. 


(1 — jO? is positive; and (4) is due to changing the sign of 


The term inside the summation in the last equality in (3.80) respects the formula 
of another key notion in information theory: cross entropy. In the context of a loss 
function, it is named cross entropy loss: 


Cly, 9) = —ylogy — (1 — y) log — 9). (3.81) 
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Hence, the optimal loss function that yields the maximum likelihood solution is 
cross entropy loss: 


o's -) = tcl, -). 


Remarks on cross entropy loss (3.81) Let us say a few words about why 
the loss function (3.81) is called cross entropy loss. This naming comes from the 
definition of cross entropy. The cross entropy is defined w.r.t. two random variables. 
For simplicity, consider two binary random variables, say X ~ Bern(p) and Y ~ 
Bern(q). For such two random variables, cross entropy is defined as: 


(p,q) := —plogq — (1 — p) log(1 — 4). (3.82) 


Notice that the formula of (3.81) is exactly the same as the term inside summation 
in (3.80), except for having different notations. Hence, it is called cross entropy loss. 
You may wonder why H ( p, q) in (3.82) is called cross entropy. The rationale comes 
from the following fact (check this in Prob 10.4): 


(p,q) = H(p) := —plogp — (1 — p) log — p) (3.83) 


where H(p) denotes the entropy of Bern(p). In Prob 10.4, you will be asked to 
verify that the equality holds when p = gq. One can interpret H ( p, q) as an entropic- 
measure of discrepancy across distributions. Hence, it is called cross entropy. 


How to solve logistic regression? In view of (3.80), the optimization prob- 
lem for logistic regression can be written as: 


m T © 
i 1 , =w x 
in X` —y log ——— — (1 —y®)1 (3.84) 
min o - o -. ; 
w >, I 8 1 + ew J 8 1 + go xO 
Let J (w) be the normalized version of the objective function: 
LS _ toe sd © C) 
Jw) := — > =y” logy — (1 — y”) log — 9”). (3.85) 
"e = 


It turns out the above optimization belongs to convex optimization. In other words, 
J(w) is convex in optimization variable w. For the rest of this section, we will prove 
the convexity, and then discuss how to solve the problem. 


Proof of convexity First, one can readily show that convexity preserves under 
addition. Why? Think about the definition of convex functions. So it suffices to 
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prove the following two: 


(i) — log ————~ is convex in w; 
1 enw" 
gow x 
(ii) — log 7- İs convex in w. 
1 +e" 


Since the second function in the above can be represented as the sum of a linear 
function and the first function: 


it suffices to prove the convexity of the first function. 
The first function can be rewritten as: 


e 


1 _wl x 


In fact, proving the convexity of the function (3.86) is a bit involved if one relies on 
the definition of convex functions. There is another way to prove. It is based on the 
second derivative of a function, called the Hessian. How to compute the Hessian? 
What is the dimension of the Hessian? For a function f : Rf 3 R, the gradient 
Vf) € R? and the Hessian V?f(x) € R2*4, If you are not familiar, check it 
from the vector calculus course or from wikipedia. 

A well-known fact says that if the Hessian of a function is positive semi-definite 
(PSD)’, then the function is convex. Check this in Prob 10.5. If this proof is too 
much, you may want to remember this fact only. No need to prove, but the state- 
ment itself is useful. Here we will use this fact to prove the convexity of the func- 
tion (3.86). 

Taking a derivative of the RHS formula in (3.86) w.r.t. w, we get: 


T 
T 1 =x” * 
Vylogi +e” *) = —-——_.. 
This is due to a chain rule of derivatives and the fact that 4 Ing = i, i ag 


and Awy = x. Taking another derivative of the above, we obtain a Hessian as 


7. We say that a symmetric matrix, say Q = QT € R?*/, is positive semi-definite if v7 Qv > 0,Vu € RŽ, i.e., 
all the eigenvalues of Q are non-negative. It is simply denoted by Q > 0. 
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follows: 
1 —w' x 
E —xe 
V2 log(1 = *) = Vy | ———— 
w og( +e ) (= 1 +4 =) 
(a 1 xxT ew (1 +e”) at gual E E 
~ In2 (1 + e7” x)2 (3.87) 
_ 1 xxl ew! x 
ad In2 (1 at enw? x)2 
=0 
df _ 


where (a) is due to the derivative rule of a quotient of two functions: $ o> 


LOO fOO You may wonder why 4 (—xe7¥"*) = xxl ew", Why not xx, 


Le) 
xTxT or x! x in front of e™™ *? One rule-of-thumb is to simply try all the can- 
didates and choose the one which does not have a syntax error (matrix dimension 
mismatch). For instance, xx (or xT x7) is just an invalid operation. x’ x is not a 
right one because the Hessian must be an d-by-d matrix. The only candidate left 
without any syntax error is xx’. We see that xx” has the single eigenvalue of ||x!||?. 
Why? Since the eigenvalue |||? is non-negative, the Hessian is PSD, and therefore 


we prove the convexity. 


Gradient descent How to solve the convex optimization problem (3.80)? Since 
there is no constraint in the optimization, w* must be the stationary point, i.e., the 
one such that 


V/(w*) = 0. (3.88) 


However, we face a challenge in determining the optimal point w* because there is 
no closed-form solution and it cannot be analytically derived. Various algorithms 
have been developed to overcome this challenge and find the optimal point without 
the need for a closed-form solution. One such algorithm is gradient ascent, which we 
learned about in Section 3.11. In this case, since the function of interest is convex, 
we use gradient descent instead. The process of gradient descent is similar to that of 
gradient ascent, with the only difference being that the estimate is updated in the 
opposite direction of the gradient. 


wit) —— yw — aVJ(w) (3.89) 


where w® the tth estimate of w* and a > 0 indicates the learning rate. We take the 
opposite direction because the sign of the gradient of a convex function is flipped 
relative to that w.r.t. a concave function. See Fig. 3.40. 
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slope: VI (w'*)) 


w (£) t-th estimate 


wD ew — aVI(w) 


Figure 3.40. How gradient descent works. 


Look ahead We have established an optimization problem for supervised learn- 
ing and observed the crucial role of cross entropy in designing the optimal loss 
function. Moreover, we acquired knowledge about solving the problem with the 
widely-used gradient descent algorithm. In the following section, we will put this 
algorithm into practice using TensorFlow to create a simple classifier. 
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3.14 Supervised Learning: TensorFlow Implementation 


Recap In the previous sections, we have formulated an optimization problem for 
supervised learning: 


min >) fce(y, 9) (3.90) 
i=1 


1__ indicates the prediction output with logistic activation and 


14e * 


where ĵ := 


€ce denotes cross entropy loss: 


ce(y, j) = —ylogy — (1 — y) log(1 — 9). (3.91) 


We proved that cross entropy loss €ce(-,-) is the optimal loss function in a sense 
of maximizing the likelihood. We also showed that the normalized version /(w) 
of the above objection function is convex in w, and therefore, it can be solved via 
gradient descent: 


m 


1 Dio ali i ~(i 
J(w) := a 2 —y® log 9 — (1 — y) log(d — 9%). (3.92) 


i=1 


Outline In this section, we will explore the implementation of the algorithm using 
a software tool for a simple classifier. This section is divided into three parts. Firstly, 
we will examine the setting of the simple classifier that we will focus on. In the 
second part, we will discuss four implementation details regarding the classifier. 
The first detail is the dataset used for training and testing. In machine learning, 
testing refers to evaluating the performance of a trained model. For this purpose, 
we will use an unseen dataset called the zest dataset, which has never been used 
during training. The second detail is how to build a deep neural network model 
with the ReLU activation function. The Perceptron, introduced in Section 3.12, 
is the first neural network. A deep neural network is an extended version of the 
Perceptron, with at least one hidden layer placed between the input and output 
layers (Ivakhnenko, 1971). The ReLU is a popular activation function often used 
in hidden layers (Glorot et al., 2011). It stands for Rectified Linear Unit, and its 
operation is given by ReLU(x) = max(0, x). The third implementation detail con- 
cerns the softmax activation used at the output layer, which is a natural extension 
of the logistic activation for multiple classes. The fourth implementation detail per- 
tains to the Adam optimizer, which is an advanced version of gradient descent and 
widely used in practice (Kingma and Ba, 2014). 
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Figure 3.41. Handwritten digit classification. 
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Figure 3.42. MNIST dataset: An input image is of 28-by-28 pixels, each indicating an 
intensity from O (white) to 1 (black); each label with size 1 takes one of the 10 classes 
from O to 9. 


The final part of this section will focus on programming the classifier using 
TensorFlow, a popular deep learning framework. Specifically, we will be using 
Keras, a high-level programming language that is fully integrated with TensorFlow. 


Handwritten digit classification We will be focusing on a simple classifier that 
aims to recognize handwritten digits from images, as shown in Fig. 3.41. For train- 
ing our model, we will be using a widely popular dataset known as the MNIST 
(Modified National Institute of Standards and Technology) dataset (LeCun et al., 
1998), which contains m = 60, 000 training images and meg = 10, 000 testing 
images. This dataset was created by remixing the examples from NIST’s original 
dataset and was named after its creator, Yann LeCun. Each image, denoted as x, 
comprises of a 28 x 28 pixel matrix, with each pixel representing a grayscale level 
ranging from 0 (white) to 1 (black). Additionally, each image has a corresponding 
label, denoted as y™, which falls under one of the ten classes, y € {0,1,...,9}, 
as shown in Fig. 3.42. 
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Figure 3.43. A two-layer fully-connected neural network where input size is 28 x 28 = 
784, the number of hidden neurons is 500 and the number of classes is 10. We employ 
ReLU activation at the hidden layer, and softmax activation at the output layer; see 
Fig. 3.44 for details. 


A deep neural network model We will use an advanced version of logistic 
regression as our model for the handwritten digit classifier. The first reason for using 
this advanced version is that logistic regression is a linear classifier and its perfor- 
mance may not be optimal in many applications since the prediction function is 
restricted to a linear function class. To overcome this limitation, researchers devel- 
oped a Perceptron-like neural network with multiple layers, known as a deep neu- 
ral network (DNN). Although the DNN was invented in the 1960s (Ivakhnenko, 
1971), its performance benefits started to be greatly appreciated only in the past 
decade, due to a big event in 2012. Geoffrey Hinton and his PhD students achieved 
human-level recognition performance on ImageNet recognition competition using 
a DNN, which was never achieved before (Krizhevsky et al., 2012). This event 
marked the start of the deep learning revolution. Since a linear classifier does not 
perform well for digit classification, we will use a simple version of DNN with only 
two layers — a hidden layer and an output layer, as shown in Fig. 3.43. By conven- 
tion, the input layer is not counted as a layer, so it is referred to as a two-layer neural 
network instead of a three-layer one. 

Each neuron in the hidden layer respects the same procedure as that in the 
Perceptron: a linear operation followed by activation. For activation, the logistic 
function or its shifted version, called the tanh function (spanning —1 and +1), 
were frequently employed in early days. However, a significant number of experts 
and practitioners have discovered that the Rectified Linear Unit (ReLU) is a more 
powerful function that enables faster training and delivers better or equivalent 
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Figure 3.44. Softmax activation employed at the output layer. This is a natural extension 
of logistic activation intended for the two-class case. 


performance (Glorot et al., 2011). As stated previously, ReLU’s function can be 
expressed as: ReLU(x) = max(0, x). In the deep learning community, a popular 
approach is to utilize ReLU activation in all hidden layers, and we will follow this 
standard as depicted in Fig. 3.43. 


Softmax activation at the output layer We have another reason to use an 
advanced version of logistic regression, which pertains to the number of classes in 
our classifier. Since logistic regression is designed for binary classification, it cannot 
be directly used for our digit classifier, which has 10 classes. To address this, we need 
to use a generalized version of logistic function known as softmax. The operation 
of softmax is illustrated in Fig. 3.44. 

Let z be the output of the last layer in a neural network prior to activation: 


z:= laz... z] ER (3.93) 


where c denotes the number of classes. The softmax function is then defined as: 
z 


eI 


pa ert 


Note that this is a natural extension of the logistic function: for ¢ = 2, 


Jj = [softmax(z)]; = je{1,2,...,c}. (3.94) 


e! 


Jı = [softmax(z)]; = ot ea 
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1 
=] + e772) 


= o (zı = z2) (3.95) 


where o (-) is the logistic function. Viewing zı — z2 as the binary classifier output 
J, this coincides with the logistic function. 

Here J; can be interpreted as the probability that the ¿th example belongs to 
class 7. Hence, like the binary classifier, one may want to assume: 


ji =P(y=(0,..., 1. ,...,0]7|x), ie {,...,c}. (3.96) 


ith position 
Under this assumption, one can verify that the optimal loss function (in a sense of 
maximizing likelihood) is again cross entropy loss: 


Cc 


COD = lal) = >) -y logy; 
j=l 


where y indicates a label of one-hot vector type. For instance, in the case of label 
= 2 with c = 10, y takes: 


07 0 
o} 1 
1| 2 
0} 3 
0} 4 
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The proof is almost the same as that in the binary classifier. So we will omit the 
proof. Instead you will have a chance to prove it in Prob 10.3. 

Due to the above rationales, softmax activation has been widely used for many 
classifiers. Hence, we will use the conventional activation in our digit classifier. 


Adam optimizer (Kingma and Ba, 2014) We employ a specific algorithm. As 
mentioned earlier, we will use an advanced version of gradient descent, called the 
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Adam optimizer. To see how the optimizer operates, let us first recall the vanilla 
gradient descent: 


wet) © yO — aV w) 


where w indicates the estimated weight in the ż-th iteration, and a denotes the 
learning rate. Notice that the weight update relies only on the current gradient, 
reflected in V/(w). Hence, in case V/(w™) fluctuates too much over iterations, 
the weight update oscillates significantly, thereby bringing about unstable training. 
To address this, people often use a variant algorithm that exploits past gradients for 
the purpose of stabilization. That is, the Adam optimizer. 

Here is how Adam works. The weight update takes the following formula 


instead: 
(Q) 
(+1) __ O ug 
w =e + a — (3.97) 
Vs) +e 
where m® indicates a weighted average of the current and past gradients: 
1l = 
m = ——, (Bim) — (1 — fr VJw®)). (3.98) 


1— $i 


Here ı € [0,1] is a hyperparameter that captures the weight of past gradients, 
and hence it is called the momentum (Polyak, 1964). The notation m stands for 


momentum. The factor is applied in front, in an effort to stabilize training in 


1 
j= t 
initial iterations (small ż). Check the detailed rationale behind this in Prob 10.7. 

s is a normalization factor that makes the effect of V/(w™) almost constant 
over t. In case V/(w™) is too big or too small, we may have significantly different 
scalings in magnitude. Similar to m™, s is defined as a weighted average of the 


current and past values (Hinton et al., 2012): 


1 


O — 
5 1B 


(Bos) + A = fo)(VI(w™))’) (3.99) 


where f2 € [0, 1] denotes another hyperparameter that captures the weight of past 
values, and s stands for square. 

Notice that the dimensions of w, m and s are the same. All the operations 
that appear in the above (including division in (3.97) and square in (3.99)) are 
component-wise. In (3.97), € is a tiny value introduced to avoid division by 0 in 
practice (usually 1078). 


TensorFlow: Loading MNIST data We will learn how to implement the simple 
digit classifier using TensorFlow programming. The first step is to load the MNIST 
dataset, which is a well-known dataset and is available in the following package: 
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tensorflow.keras.datasets. Even more, train and test datasets are already therein 
with a proper split ratio. So we do not need to worry about how to split them. The 
only script that we should write is: 

from tensorflow.keras.datasets import mnist 

(X_train, y_train), (X_test, y_test) = mnist.load_dataQ) 

X_train = X_train/255. 

X_test = X_test/255. 


We divide the input (X_train or X_test) by its maximum value 255 for the purpose 
of normalization. This procedure is done as a part of data preprocessing. 


TensorFlow: A two-layer DNN In order to implement the simple DNN, illus- 
trated in Fig. 3.43, we rely upon two major packages: 


(i) tensorflow.keras.models; 


Cii) tensorflow.keras.layers. 


The models package contains several functionalities regarding a neural network. 
One major module is Sequential which is a neural network entity and hence can 
be described as a linear stack of layers. The layers package includes many elements 
of a neural network. Examples include fully-connected dense layers and activation 
functions. These two allow us to readily construct a model illustrated in Fig. 3.43. 


from tensorflow.keras.models import Sequential 
from tensorflow.keras.layers import Dense, Flatten 


model = SequentialO 
model.add(Flatten(input_shape=(28,28))) 
model.add(Dense(500, activation=’relu’)) 
model.add(Dense(10, activation='’softmax’)) 


Flatten is an entity that indicates a vector expanded from a higher dimensional 
one, like a 2D matrix. In this example, a digit image of size 28-by-28 is flattened 
into a vector of size 784(= 28 x 28). add() is a method for attaching an interested 
layer to the last part in the sequential model. Dense refers to a fully-connected layer. 
The input size is automatically determined by the last part that it will be attached 
to. The only thing to specify is the number of output neurons. In this example, 500 
refers to the number of hidden neurons. We can also set an activation function with 
another argument, like activation=’relu’. The output layer comes with 10 neurons 
(coinciding with the number of classes) and softmax activation. 


TensorFlow: Training a model For training, we need to first set up an algo- 
rithm (optimizer) to be employed. We use the Adam optimizer. As mentioned ear- 
lier, Adam has three key hyperparameters: (i) the learning rate a; (ii) J1 (capturing 
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the weight of past gradients); and (iii) 2 (indicating the weight of the square of 
past gradients). The default choice reads: (a, £1, 62) = (0.001, 0.9, 0.999). These 
values would be set if nothing is specified. 

Next, we specify a loss function. We employ the optimal loss function: cross 
entropy loss. A performance metric that we will look at during training and testing 
can also be specified. One common metric is accuracy. One can set all of these via 
another method compile. 

model.compile(optimizer='adam’, 


loss=’sparse_categorical_crossentropy’, 
metrics=['acc’]) 


The option optimizer=’adam’ sets the default choice of the learning rate and betas. 
For a manual choice, we define: 


opt=tensorflow.keras.optimizers.Adam( 
learning_rate=0.01, 
beta_1 = 0.92, 
beta_2 = 0.992) 


We then replace the above option with optimizer=opt. As for the loss option 
in compile, we employ ’sparse_categorical_crossentropy’, which indicates cross 
entropy loss beyond the binary case. 

Now we can bring this to train the model on MNIST data. During training, we 
employ a part of the entire examples to compute a gradient of a loss function. The 
part is called a batch. Two more terminologies. One is the step which refers to a loss 
computation procedure spanning the examples only in a single batch. The other is 
the epoch which refers to the entire procedure associated with all the examples. In 
our experiment, we use the batch size of 64 and the number 20 of epochs. 


model.fit(X_train,y_train,batch_size=64,epochs=20) 


TensorFlow: Testing the trained model For testing, we need to make a pre- 
diction from the model ouput. To this end, we use the predict() function as follows: 


model.predict(X_test).argmax(1) 
Here argmax(1) returns the class w.r.t. the highest softmax output among the 10 
classes. In order to evaluate the test accuracy, we use the evaluate() function: 


model.evaluate(X_test, y_test) 


Look ahead We have concluded the supervised learning part, but there are addi- 
tional topics that may pique your interest. However, due to the desire to cover other 
subjects, we will end our discussion of supervised learning here. If you are interested 
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in delving deeper into this topic, we recommend taking one of the many useful 
online deep learning courses, such as those offered by Coursera. Moving forward, 
we will shift our focus to unsupervised learning and explore one of the most pop- 
ular machine learning frameworks, Generative Adversarial Networks (GANS). In the 
upcoming section, we will delve into the intimate connections between KL diver- 
gence, mutual information, and GANSs, providing detailed coverage of the subject 


matter. 
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Problem Set 10 


Prob 10.1 (Basic concepts on machine learning) 


(a) State the definition of an algorithm. 

(b) State the definition of machine learning. 

(c) State the definition of artificial intelligence. 

(d) State the definition of examples (the terminology in machine learning). 
(e) State the definition of supervised learning. 


Prob 10.2 (KL divergence and cross entropy) The KL divergence is a valu- 
able metric that measures the dissimilarity between two distributions. As demon- 
strated in Prob 1.13, it can also represent mutual information /(X; Y) by consid- 
ering two distributions as a joint distribution P(X, Y) and a product distribution 
P(X)P(Y). Furthermore, as revealed in Prob 4.4, it plays a crucial role in charac- 
terizing the exponential decay rate of probability, which is essential in the statistical 
field known as Large Deviation Theory. 

Additionally, in this problem, we will utilize it to represent cross entropy, a well- 
known concept in machine learning commonly used in classification problems, 
which is defined as follows: 

H : l : E, | 1 a (3.100) 
ae ae 6 


xEX 


Show that 


A (p,q) = H(p) + kL(pllq) 


where H(p) indicates the entropy of a random variable having the distribution of p. 


Prob 10.3 (Softmax activation for multi-class classifiers) This problem 
explores a general classifier setting in which the number of classes is not limited to 
2, say c € N. Let z := [z1,22,...,2-]7 € R° be the output of a multi-perceptron 


prior to activation: 
z = wlx (3.101) 


where x € R” indicates the input and w; := [wyj,..., Win)” e R” denotes the 
weight vector associated with the jth neuron in the output. 

In an attempt to make those real values z;’s being interpreted as probability quan- 
tities that lie in between 0 and 1, people usually employ the following activation 
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function, called softmax: 


Zj 
Jj = [softmax(z)]; = £ je{1,2,...,c}. (3.102) 


A Zk 
Lee 
This is a natural extension of the logistic function: for c = 2, 
Zi 
elp ea 


= 1 (3.103) 
1 + e7172) 


Jı := [softmax(z)}1 = 


= o (zı — 22) 


where øo (-) is the logistic function. Viewing zı — z2 as the binary classifier output 
J, this coincides with logistic regression. 

Let y € {{1,0,...,0]7,[0,1,0,...,0]7,...,[0,...,0,1]7} be a label of one- 
hot-encoded-vector type. Here J; can be interpreted as the probability that the 7th 
example is classified into class 7. Hence, we assume that 


ji=P(y=(0,..., 1. ,...,0]7 |x), ie {1,..., ch. (3.104) 
ith position 
We also assume that examples {(x® , y)}”, are independent over i. 


(a) Derive the likelihood of training examples: 
Py, .., 71x, x). (3.105) 


Express it in terms of y’s and js, 
(b) Derive the optimal loss function that maximizes the likelihood (3.105). 
(c) What is the name of the optimal loss function derived in part (0)? 


Prob 10.4 (Cross entropy) Let p and q be two distributions. In information 
theory, there is an important notion, called cross entropy (Cover and Joy, 2006): 


1 
(p,q) := — S ple) log, q(x) = Ep los, z5] (3.106) 


xEX 


where X € X is a discrete random variable. Show that 


H (p,q) > H(p) (3.107) 
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where H(p) denotes the entropy of a random variable with p: 


1 
A(p):=- (x) log, p(x) = E Io =| (3.108) 
p 2? 8:20) = Ep | log, z 
Also identify conditions under which the equality in (3.107) holds. 
Hint: Think about Jensen’s inequality in Prob 1.5. 


Prob 10.5 (2nd-order condition of convexity) Suppose f : R? > Ris 
twice differentiable, i.e., its second derivative Vf (also called the Hessian) exists at 
each point in dom/. A well-known fact w.r.t. convexity is: f is convex if and only if 


domf is convex; 
(3.109) 
Vf (x) is positive semi-definite, i.e., V7f(x) +0 Vx € domf 


where domf denotes the domain of the function f. This problem explores the 
proof of this via the following subproblems. 


(a) State the definition of a positive semi-definite matrix. 

(b) Suppose d = 1. Show that if f(x) is convex, then (3.109) holds. 
(c) Suppose d = 1. Show that if (3.109) holds, f(x) is convex. 

(d) Prove the 2nd-order condition for arbitrary d. 


Prob 10.6 (Gradient descent) Consider a function J/(w) = w? + 2w where 
w e R. Consider gradient descent with the learning rate a = x and w = 2, 


(a) Describe how gradient descent works. 
(b) Using Python, run gradient descent to plot w® as a function of t. 


Prob 10.7 (Optimizers) Consider gradient descent: 
yer) = w® — aVJ(w®) 


where w® indicates the weights of an interested model at the t-th iteration; J (w®) 
denotes the cost function evaluated at w); and a is the learning rate. Note that 
only the current gradient, reflected in V/(w™), affects the weight update. 


(a) (Momentum optimizer (Polyak, 1964)) In the literature, there is a promi- 
nent variant of gradient descent that takes into account past gradients as 
well. Using the past information, one can damp an oscillating effect in 
the weight update that may incur instability in training. To capture past 
gradients and therefore address the oscillation problem, another quantity, 
denoted by m, is introduced: 


m = Bm?) + (1 = py(-V/w®)) (3.110) 
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(b) 


(c) 


where p denotes another hyperparameter that captures the weight of the 
past gradients, called the momentum. Here m stands for the momentum 
vector. The variant of the algorithm (called the momentum optimizer) takes 
the following update for w¢+)): 


per) = wO + am. (3.111) 


Show that 


t—1 
WD = w — ad — B) >) A VU ap a. 
k=0 


(Bias correction) Assuming that VJ (w) is the same for all ż and m® = 0, 
show that 


wt) = w — a(l — py VJ"). 


Note: For a large value of t, 1 — 6’ ~ 1, so it has almost the same scaling 
as that in the regular gradient descent. On the other hand, for a small value 
of t, 1 — fh’ can be small, being far from 1. For instance, when £ = 0.9 
and ¢ = 2, 1 — £* = 0.19. This motivates us to rescale the moment m” 
in (3.110) through division by 1 — 8’. Hence, in practice, we use: 


(*) 
en 
AO = oe (3.112) 
we) = wO + anO. (3.113) 


This technique is called the bias correction. 
(Adam optimizer (Kingma and Ba, 2014)) Notice in (3.110) that a very large 
or very small value of V/(w™) affects the weight update in quite a different 
scaling. In an effort to avoid such a different scaling problem, people in 
practice donormalization in the weight update (3.113) via a normalization 
factor, denoted by $ (Hinton et al., 2012): 

PAO) 


(+1) — ,,,(4) 
w =? + a ——— (3.114) 
VO + € 


where the division is component-wise, and 


nm) 


(3.115) 


m® = Bym*) — (1 — B)V/(w), (3.116) 
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(¢) 
a 


=—,;> 
I= 2 


(3.117) 


sO = fos) + = (VI (Ww). (3.118) 


Here (-)* indicates a component-wise square; € is a tiny value introduced 
to avoid division by 0 in practice (usually 1078); and s stands for square. 
This optimizer (3.114) is called the Adam optimizer. Explain the rationale 
behind the division by 1 — £3 in (3.118). 


Prob 10.8 (TensorFlow implementation of a digit classifier) Consider a 
handwritten digit classifier in Section 3.14. In this problem, you are asked to build 
a classifier using a two-layer neural network with ReLU activation at the hidden 
layer and softmax activation at the output layer. 


(a) (MNIST dataset) Use the following script (or otherwise), load the MNIST 
dataset: 
from tensorflow.keras.datasets import mnist 
(X_train,y_train),(X_test,y_test)=mnist.load_dataQ 
X_train = X_train/255. 
X_test = X_test/255. 


What are m (the number of training examples) and mest? What are the 
shapes of X_train and y_train? 


(b) (Data visualization) Upon the code in part (a) being executed, report an 
output for the following: 


import matplotlib.pyplot as plt 
num_of_images = 60 
for index in range(i,num_of_images+1): 
plt.subplot(6,10, index) 
plt.axisC off’) 
plt.imshow(X_train[index], cmap = ’gray_r’) 


(c) (Model) Using a skeleton code provided in Section 3.14, write a script for 
a two-layer neural network model with 500 hidden units fed by MNIST 
data. 

(d) (Training) Using a skeleton code in Section 3.14, write a script for train- 
ing the model generated in part (c) with cross entropy loss. Use the Adam 
optimizer with: 


learning rate = 0.001; (61, B2) = (0.9, 0.999) 
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and the number of epochs is 10. Also plot a training loss as a function of 
epochs. 

(e) (Testing) Using a skeleton code in Section 3.14, write a script for testing the 
model (trained in part (d)). What is the test accuracy? 


Prob 10.9 (True or False?) 


(a) Consider an optimization problem for supervised learning: 
Z . . 
min X £(y, foe) (3.119) 
—= 
where {(x, y)}”, indicate input-output example pairs, and 


1 


2 (3.120) 
1 +e 


fo) = 
The optimal loss function (in a sense of maximizing the likelihood) is: 


(S) = —ylogy — (1 — y) log(1 — 5). (3.121) 


(b) For two arbitrary distributions, say p and q, consider cross entropy H ( p, q). 
Then, 


H(p,4) > H(q) (3.122) 


where H (q) is the entropy w.r.t. q. 
(c) For two arbitrary distributions, say p and q, consider cross entropy: 


1 
A(p, q) = — Sp) log q(x) = Ep tog | (3.123) 


xEX 


where X € X is a discrete random variable. Then, 


H(p, 4) = H(p) = — È p(x) log p(x). (3.124) 


xEX 


only when g = p. 
(d) Consider a binary classifier where we are given input-output example pairs 
(O, yO). Let 0 < JO < 1 be the classifier output for the ith example. 
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Let w be parameters of the classifier. Define: 


. > 1l is i) a(i 
Weg = arg min af S bce(y, 5 ) 


1=l 
* . 1 M i)a 
wh = argmin — D KLOP IO) 
i=1 


where €¢¢(-, -) denotes cross entropy loss and KL(y® || 9) indicates the KL 
divergence between two binary random variables with parameters y and 
9, respectively. Then, 


Wee = Wx: 


(e) Suppose we execute the following code: 


import numpy as np 

a = np.random.randn(4,3,3) 
b = np.ones_like(a) 
print(blO].shape) 
print(b.shape[O]) 


Then, the two prints yield the same results. 
(f) Suppose that image is an MNIST image of numpy array type. Then, one 
can use the following commands to plot the image: 


import matplotlib.pyplot as plt 
plt.imshow(image.squeeze(), cmap='gray_r’) 
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3.15 Unsupervised Learning: Generative Modeling 


Recap Over the past three sections, we have covered basic contents related to 
supervised learning. The primary objective of supervised learning is to estimate 
a function f(-) of a computer system from input-output samples, as depicted in 
Fig. 3.45. To transform a function optimization problem, a natural form of super- 
vised learning, into a parameterized optimization problem, we represented the func- 
tion with weights (or parameters) based on a particular system architecture, namely, 
the Perceptron. By using the logistic function, we obtained logistic regression and 
proved that cross entropy is the optimal loss function for maximizing likelihood. 
We also examined the more expressive Deep Neural Network (DNN) architecture 
for f(-). As for a choice of activation functions, we used ReLU activation func- 
tions for all hidden neurons and logistic (or softmax) function for the output layer. 
To solve optimization, we investigated the widely used gradient descent algorithm 
and its more advanced version, the Adam optimizer, which utilizes past gradients 
for stable training. Lastly, we learned how to implement the algorithm through 
TensorFlow. 


Unsupervised learning What comes next? In reality, there is a significant chal- 
lenge that arises in supervised learning. In many practical scenarios, collecting 
labeled data is not a straightforward task. Typically, obtaining labeled data is a costly 
process that requires extensive human labor for annotations. Therefore, there is a 
growing need to explore solutions that do not rely on labeled data. Then, what can 
we do only with xO? 

This is where unsupervised learning becomes relevant. Unsupervised learning 
is a methodology for acquiring knowledge about data {x}”, without relying 
on labeled data. One of the prominent targets for unsupervised learning is the 


(i) „Gm 

i iY ) i=l 

Figure 3.45. Supervised learning: Learning the function f(-) of an interested system from 
data {OO yO). 


296 Data Science Applications 


lan Goodfellow 2014 


Figure 3.46. lan Goodfellow, a young figure in the modern AlI field. He is best known as 
the inventor of the Generative Adversarial Networks (GANs), which made a big wave in 
the Al history. 


probability distribution, which is the most complex yet fundamental information. 
The probability distribution enables us to generate realistic signals according to our 
preferences. The generative modeling method is the unsupervised learning tech- 
nique used to learn this fundamental entity and is widely recognized in the field. 
Therefore, we will concentrate on this approach. 


Generative Adversarial Networks (GANs) Our focus will be on Genera- 
tive Adversarial Networks (GANs) (Goodfellow et al., 2014), a popular generative 
model in the literature. Ian Goodfellow, a research scientist in the AI field, invented 
GANs; see Fig. 3.46. GANs have proven to be instrumental in various applica- 
tions, including image creation, human image synthesis, image inpainting, color- 
ing, super-resolution image synthesis, speech synthesis, style transfer, and robot 
navigation. GANs are so effective that, as of October 3, 2019, the state of Califor- 
nia passed a bill banning the use of GANs to create fake pornography without the 
consent of those depicted. From an information theory perspective, GANs are an 
intriguing framework since they are closely linked to KL divergence and mutual 
information. 


Outline Inthe upcoming sections, we will delve deep into the connection between 
GANs, KL divergence, and mutual information. Our investigation will unfold in 
four parts. Firstly, we will explore the concept of generative modeling. Secondly, we 
will formulate an optimization problem, with a particular emphasis on the GAN 
framework. Then, we will establish a connection to the KL divergence and mutual 
information. Finally, we will learn how to solve the GAN optimization problem 
and implement it using TensorFlow. For this section, our focus will be on the first 
two parts. 


Generative modeling Generative modeling refers to a method of producing 
synthetic data that follows a similar distribution as real data. The model parameters 
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generative 


-— fake data 
model 


t 


real data 


Figure 3.47. A generative model is the one that generates fake data which resembles 
real data. Here what resembling means in a mathematical language is that it has a similar 
distribution. 


are trained using real data so that the model generates fake data that resembles the 
real data. Fig. 3.47 provides a visual illustration of this process. The input signal, 
which is not shown in the figure but should be fed into the model, can be either a 
randomly generated signal or a synthesized signal that serves as a seed for the fake 
data. The choice of input signal depends on the specific application. We will delve 
into this in further detail later on. 


A remark on generative modeling The design of a generative model has been 
a classical age-old problem, and it is considered one of the most important problems 
in statistics. This is because the main objective of the statistics field is to determine 
the probability distribution of data, and the generative model serves as the under- 
lying framework. Furthermore, the model can be utilized as a concrete function 
block, also known as the generator in the field, to generate realistic fake data. Den- 
sity estimation is another common name for the problem in statistics, where the 
density pertains to the probability distribution. 


Notations We relate generative modeling to optimization. We feed an input sig- 
nal that one can arbitrary synthesize. A common way to generate the input is to 
use Gaussian or uniform distribution. For this input, we employ a conventional “x” 
notation, say x € R*, where kis a dimension. To avoid the conflict in notation with 
real data Ope p we use a different notation, say { OV p for real data. Please 
don’t be confused with labels. In fact, the convention in machine learning is to use 
a z notation for a fake input while maintaining {x}, for real data. This may 
be another way that you should take when writing papers. Let y € R” be a fake 
output. Let {(x, 7%)}”, be such fake input-output m pairs and let {y}”, be 
real data examples. See Fig. 3.48. 


Goal Let G(-) be a function of the generator. Then, the goal of the generative 
model can be stated as follows: Designing G(-) such that 


GOVE, & (yO, in distribution. 
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Figure 3.48. Problem formulation for generative modeling. 


What does it mean by “in distribution”? To make it clear, we need to quantify close- 
ness between two distributions. One natural yet prominent approach employed in 
statistics is to take the following two steps: 


1. Compute empirical distributions or estimate distributions from {y}”_, and 
{(x, JO) Let such distributions be: 


Qr, Qs 


for real and fake data, respectively. 

2. Next employ a well-known divergence measure in statistics which can serve 
to quantify closeness of two distributions. Let D(-, -) be one such divergence 
measure. Then, the similarity between Qy and Qș can be quantified as: 


D(Qy, Q). 
Taking the above approach, one can state the goal as: Designing G(-) such that 
D(Qy, Qp) is minimized. 


Optimization under the approach Under the approach, one can formulate 


an optimization problem as: 
a D(Qy, Qy). (3.125) 


As you may recognize, a couple of issues arise in solving the above problem (3.125). 
One issue is that it is a function optimization problem. As mentioned earlier, one 
common way to resolve this is to parameterize the function with a neural network: 


nin PQr Qy) (3.126) 


where NV indicates a class of neural networks. 
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There are two more issues. First, the objective function D(Qy, Qp) is a compli- 
cated function of the knob G(-). Note that Q¢ is a function of G(-), as f = G(x). 
The objective function is a twice folded composite function of G(-). The second is 
perhaps the most fundamental issue. It is not clear as to how to choose a divergence 
measure D(-,-). 


Look ahead There are various methods to tackle the aforementioned concerns, 
and one of them leads to formulating an optimization problem for GANs. The 
following section will explore this method for deriving the optimization problem 
for GANs. 
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3.16 Generative Adversarial Networks (GANs) and KL 
Divergence 


Recap In the previous section, we introduced unsupervised learning. The goal 
of unsupervised learning is to learn something about data, which we denoted by 
{ yO Ve p instead of Ope ,- There are several unsupervised learning methods 
available depending on the desired outcome. However, we have emphasized one 
particular approach, which is generative modeling, aimed at learning the probabil- 
ity distribution. We have also formulated an optimization problem for generative 
modeling: 


aay D(Qy, Qp) (3.127) 


where Qy and Q» indicate the empirical distributions for real and fake data, respec- 
tively; G(-) denotes the function of the generator; D(-, -) is a divergence measure; 
and N is a class of neural networks. Next, we brought up a few challenges in 
the optimization problem: (i) the optimization is a function optimization; (ii) the 
objective function involves complex dependencies on G(-); and (iii) the choice of 
D(-,-) is not straightforward. 

We also pointed out that there are approaches to tackle these issues, one of which 


leads to an optimization problem for a highly effective generative model called Gen- 
erative Adversarial Networks (GANS). 


Outline This section will delve into the details on GANs. The section consists of 
three parts. Firstly, we will examine the path that leads to GANs. Secondly, we will 
derive an optimization problem for GANs. Finally, we will demonstrate that GANs 
have the close connection with the KL divergence and mutual information. 


What is the way that leads to GANs? Remember one challenge that we are 
faced with in the optimization problem (3.127): D(Qy, Qp) is a complicated func- 
tion of G(-). To address this, we take an indirect way to represent D(Qy, Qp). We 
first observe how D(Qy, Qp) should behave, and then based on the observation, 
we will come up with an indirect way to mimic the behaviour. It turns out the way 
leads us to explicitly compute D(Qy, Qp). Below are details. 


How D(Qy,Qy) should behave? One observation that we can make is that if 
one can easily discriminate real data y from fake data J, then the divergence must be 
large; otherwise, it should be small. This motivates us to: 


Interpret D(Qy, Qp) as the ability to discriminate. 
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real data 
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fake data 


Figure 3.49. Discriminator wishes to output D(-) such that D(y) is as large as possible 
while D(y) is as small as possible. 


We introduce an entity that can serve the discriminating function. This particular 
entity was introduced by Ian Goodfellow, the inventor of GAN, and he named it: 


Discriminator. 


Goodfellow considered a binary-output discriminator which takes as an input, 
either real data y or fake data y. He then wanted to design D(-) such that D(-) 
well approximates the probability that the input (-) is real data: 


D(-) © P((-) = real data). 


Noticing that 


PCy = real) = 1; 


P(y = real) = 0, 
he wanted to design D(-) such that: 
D()) is as large as possible, close to 1; 
D(jJ) is as small as possible, close to 0. 
See Fig. 3.49. 


How to quantity the ability to discriminate? Keeping the picture Fig. 3.49 
in his mind, he wanted to quantify the ability to discriminate. To this end, he 
observed that if D(-) can easily discriminate, then we should have: 


Diy) T 1-DG) T. 


Although one simplistic approach to capturing the ability is to add the above two 
terms, Goodfellow chose to use a logarithmic summation instead: 


log D(y) + log(1 — D(9)). (3.128) 
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Figure 3.50. A two-player game for GAN: Discriminator D(-) wishes to maximize the 
quantified ability (3.129), while another player, generator G(-), wants to minimize (3.129). 


During the NeurIPS 2016 conference, Goodfellow presented a tutorial on GANs 
and referenced a paper from AISTATS 2010 as the source of inspiration for the 
problem formulation (Gutmann and Hyvärinen, 2010). See Eq. (3) in the paper. 

Making the particular choice, he quantified the ability to discriminate for m 
examples as: 


- Š log D(y) + log — D(G)). (3.129) 


i=1 


A two-player game Goodfellow then introduced a two-player game in which 
player 1, discriminator D(-), wishes to maximize the quantified ability (3.129), 
while player 2, generator G(-), wants to minimize (3.129). See Fig. 3.50 for illus- 
tration. 


Optimization for GANs The two-player game motivated him to formulate the 
following min max optimization problem: 


m 


1 ’ 
min max — > log D(y) + log(1 — DOGON). (3.130) 
Be eet el) 


You may be wondering why the order of “max min” was not used instead (i.e., first 
taking “min” and then “max”). While that is a possible approach, there is a specific 
reason why “min max” is preferred, which will become clear shortly. Note that the 
optimization is focused on two functions, D(-) and G(-), meaning that it is still a 
function optimization. Fortunately, the GAN paper was published in 2014, after 
the start of the deep learning revolution. This enabled Goodfellow to appreciate 
the power of neural networks: “Deep neural networks can represent any arbitrary 
function well.” This inspired him to use neural networks to parameterize the two 
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functions, resulting in the following optimization problem: 


— Slog D(y) + log(1 — D(F 3.131 
min max T (y) + log — D(G™)) (3.131) 


where N denotes a class of neural networks. This is the optimization problem for 


GANS. 


Related to original optimization? Remember what we mentioned earlier. The 
way leading to the GAN optimization is an indirect way of solving the original 
optimization problem: 


mi D(Qy, Qy). (3.132) 


What is the relationship between the two problems, (3.131) and (3.132)? These 
problems are closely linked, and this is precisely where the selection of “min 
max” (rather than “max min”) becomes important. The alternative approach can- 
not establish a connection. Research has demonstrated that, assuming deep neu- 
ral networks can effectively represent any function, the GAN optimization prob- 
lem (3.131) can be transformed into the original optimization form (3.132). Below 
we will demonstrate this. 


Simplification & manipulation Let us start by simplifying the GAN optimiza- 
tion (3.131). Since we assume that M can represent any arbitrary function, the 
problem (3.131) becomes unconstrained: 


min F X log D(y) + log(1 — DG™)). (3.133) 


The objective is a function of D(-), and the two functions D(-)’s appear but with 
different arguments: one is y“); the other is 7. So in the current form (3.133), the 
inner (max) optimization is not quite tractable to solve. In an attempt to make it 
tractable, let us express it in a different manner using the following notations. 

Delige a random vector Y which takes one of the m real examples with proba- 
bility + (uniform distribution): 


1 , 
Y e {y®,..., 9} = V; Qr(y) = -, ie {1,2,...,m} 


where Qy indicates the probability distribution of Y. Similarly define Y for fake 


examples: 


> A A(m $ a(i 1 . 
EG, IM QGP) Fe (1,2... mh 
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where Qy indicates the probability distribution of Ê. Using these notations, one 
can rewrite the problem (3.133) as: 


min max $ Qr) log Dy) + QGP) log(1 — DGO). 68.134) 
i=1 


Still we have different arguments in the two D(-) functions. To address this, we 
introduce another notation. Let z € Y U Y. Newly define Qy(-) and Qp (-) such 
that: 


Qy(z) := 0 ifz € V\); (3.135) 
Qy(z) :=0ifze V\Y. (3.136) 


Using the z notation, one can rewrite the problem (3.134) as: 


min max S Qr) log D(z) + Qpe) log(1 — D(z)). (3.137) 
zeVYUY 


We see that the same arguments appear in the two D(-) functions. 


Solving the inner optimization We are ready to solve the inner optimization 
in (3.137). Key observations are: log D(z) is concave in D(-); log(1 — D(z)) is 
concave in D(-); and therefore, the objective function is concave in D(-). This 
implies that the objective has the unique maximum in the function space D(-). 
Hence, one can find the maximum by searching for the stationary point. Taking a 
derivative and setting it to zero, we get: 


1 are) Qe) J 
e 7 S| = 


This then yields: 
¥ Qr@) A 
D == : .138 
© Borqe@ 79” i 
Plugging this into (3.137), we obtain: 
. Qr (2) Qie) 
min log — = > (z) log ———>~——___., 
mie 2, vOe gara * YOTO 


zeyUŤ 
(3.139) 
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Connection to KL divergence We massage the objective function in (3.139) 
to express it as: 


w > Qy(z) log Toa aor Qy(z) log — 3 —2. (3.140) 
ae 


PT 


The above underbraced term can be expressed with a well-known divergence 
measure: the KL divergence. Hence, we get: 


min >, Qri@)log Tea g tO log ger se ey 
WOO 


oe: 
= min KL(Qy||(Qy + Q¢)/2) + KLQ>||(Qy + Qy)/2) — 2. 


(3.142) 


Slightly manipulating the above, we obtain an equivalent optimization: 


1 
Goan = ae 7 KL(Qril(Qy + Qp)/2) + KLQpI Qr + Qp)/2)}- 
(3.143) 


Note that the objective coincides with Jensen-Shannon divergence that we intro- 
duced in Prob 1.13. 


Connection to mutual information From (3.143), we can also make a con- 
nection to mutual information. Recall what you were asked to prove in Prob 1.13. 
That is, for two random variables, say T and Y, 


I(T; Ý) = $ Pr(t)kL@;p,IIPy) (3.144) 
teT 


where Py and P; denote the probability distributions of T and Y, respectively; 
and Pz, indicates the conditional distribution of Y given T = ¢. 


Suppose that T ~ Bern() and we define Y as: 
7 Y, T=1; 

YS 4 4 
Y, T=0. 


Py = Qy, Py = Qp, Pz = (Qr + Qy)/2 


Then, we get: 
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where the last is due to the total probability law (why?). This together with (3.143) 
and (3.144) gives: 


Ce ae Y). (3.145) 


Look ahead We have developed an optimization problem for GANs and estab- 
lished an intriguing link to the KL divergence and mutual information. In the 
subsequent section, we will explore a method for solving the GAN optimization 
problem (3.131) and implement it utilizing TensorFlow. 
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3.17  GANs: TensorFlow Implementation 


Recap In the prior section, we investigated Goodfellow’s approach to formulate 
an optimization problem for GANs. He began by quantifying the ability to dis- 
criminate real against fake samples: 


1< oe 
S log D(y) + log — DG) (3.146) 
=l 


where zË ) and jO = = G(x®) indicate real and fake samples, respectively; D(-) 
denotes the output of discriminator; and m is the number of examples. He then 
introduced two players: (i) player 1, discriminator, who wishes to maximize the 
ability; (ii) player 2, generator, who wants to minimize it. This led to the optimiza- 
tion problem for GANs: 


— Slog DË) + log(1 — DG H 
ipg De GË) + log(1 — D(GE®))) 3.147) 


where NV denotes a class of neural networks. Lastly we demonstrated that the prob- 
lem (3.147) can be stated in terms of the KL divergence or mutual information, 
thus making a connection to information theory. 

Two natural questions arise. First, how to solve the problem (3.147)? Second, 
how to do TensorFlow implementation? 


Outline This section will address two inquiries. We will cover four stuffs in detail. 
Initially, we will explore a practical approach to resolving problem (3.147). Next, 
we will conduct a case study to exercise the approach. Specifically, we will focus 
on generating MNIST-style handwritten digit images. We will then delve into one 
important implementation detail: Batch Normalization (Ioffe and Szegedy, 2015), 
which is known to be quite useful for deep neural networks. Finally, we will acquire 
the knowledge of scripting a TensorFlow program for software implementation. 


Parameterization Solving the problem (3.147) starts with parameterizing the 
two functions G(-) and D(-) with neural networks: 


eo - 
min max — Š log Da(y) + log(1 — Do (Gy(x))) (3.148) 
= ce i=1 


= (w,9) 


where w and 0 indicate parameters for G(-) and D(-), respectively. Is the parameter- 
ized problem (3.148) the one that we are familiar with? In other words, is J (w, 0) is 
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convex in w? Is J (w, @) is concave in 0? Unfortunately, it is not the case. In general, 
the objective is highly non-convex in w and highly non-concave in 0. 

Then, what can we do? In fact, there is nothing we can do more beyond what 
we know. We only know how to find a stationary point via a method like gradient 
descent. One practical way is to simply look for a stationary point, say (w*,6*), 
such that 


Vaj (w*, 0*) =0, Vo] (w*, 0*) = 0, 


while cross-fingering that such a point yields a near optimal performance. Luck- 
ily, it is often the case in practice, especially when employing neural networks for 
parameterization. Huge efforts have been made by many smart theorists in figur- 
ing out why that is the case, e.g., (Arora et al., 2017). However, a clear theoretical 
understanding is still lacking despite their efforts. 


Alternating gradient descent One practical method to attempt to find (yet 
not necessarily guarantee to find) such a stationary point in the min-max optimiza- 
tion (3.148) is: alternating gradient descent. 

Here is how it works. At the ¢th iteration, update generator’s weight: 


iO) wO = a1 Vu (w, 0) 


where w and 6” denote the weights of generator and discriminator at the tth 
iteration, respectively; and @ is the learning rate for generator. Given (wt!) 4), 
we next update discriminator’s weight as per: 


oD e aM + a2 VoJ (wt ?,0®) 


where az is the learning rate for discriminator. In the discriminator update, we 
perform gradient ascent. Lastly we repeat the above two until converged. 

In practice, we may wish to control the frequency of discriminator weight update 
relative to that of generator. To this end, we often employ & : 1 alternating gradient 
descent: 


1. Update generator’s weight: 
wD e yw — a Vu (w, Ae), 
2. Update discriminator’s weight k times while fixing w+): for i = 1:k, 
et) © gEkti-D 4 a Vg (wt D, QEHID, 


3. Repeat the above. 
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You may wonder why we update discriminator more frequently than generator. 
Usually more updates in the inner optimization yield better performances in prac- 
tice. Further, we employ the Adam optimizer together with batches. We leave 
details in Prob 11.6. 


A practical tip on generator Let us say a few words about generator optimiza- 
tion. Given discriminator’s parameter 0, the generator wishes to minimize: 


1 z : 
min — }" log Do (y®) + log(1 — Do (Gy(x))) 
w mB * 
ieB 
where B indicates a batch and mg is the batch size (the number of examples in the 
batch). Notice that log Do (y®) in the above is irrelevant of generator’s weight w. 
Hence, it suffices to minimize: 


1 “ 
min — J log(1 — Do(Gi(x))) 
“mB i€B 


~ 


generator loss 


where the underbraced term is called “generator loss”. However, in practice, instead 
of minimizing the generator loss directly, people rely on the following proxy: 


sie S| = log Da(Gu(«)). (3.149) 


You may wonder why. There is a technical rationale behind the use of the proxy. 
Check this in Prob 11.2. 


Task We introduce one case study for implementation. The task is related to the 
simple digit classifier that we implemented in Section 3.14. The task is to generate 
MNIST style handwritten digit images, as illustrated in Fig. 3.51. We intend to 
train generator so that it outputs an MNIST style fake image when fed by a random 
input signal. 


Model for generator Asa generator model, we employ a 5-layer fully-connected 
neural network with four hidden layers, as depicted in Fig. 3.52. For activation at 
each hidden layer, we employ ReLU. Remember that an MNIST image consists of 
28-by-28 pixels, each indicating a gray-scaled value that spans from 0 to 1. Hence, 
for the output layer, we use 784 (= 28 x 28) neurons and logistic activation to 
ensure the range of [0, 1]. 

The employed network has five layers, so it is deeper than the two layer network 
that we used earlier. In practice, for a somewhat deep neural network, each layer’s 
signals can exhibit quite different scalings. Such dynamically-swinged scaling yields 
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Figure 3.51. Generator for MNIST-style handwritten digit images. 
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Figure 3.52. Generator: A 5-layer fully-connected neural network where the input size 
(the dimension of a fake input signal) is 100; the numbers of hidden neurons are 128, 256, 
512, 1024; and the output size is 784 (=28 x 28). We employ ReLU activation at every 
hidden layer, and logistic activation at the output layer to ensure O-to-1 output signals. 
We use Batch Normalization prior to ReLU at each hidden layer. See Fig. 3.53 for details. 


a detrimental effect upon training: unstable training. Hence, people often apply an 
additional procedure (prior to ReLU), in order to control the scaling in our own 
manner. The procedure is called: Batch Normalization. 


Batch Normalization (loffe and Szegedy, 2015) Here is how it works. See 
Fig. 3.53. For illustrative purpose, focus on one particular hidden layer. Let z := 
[zis .- -, Zn]? be the output of the considered hidden layer prior to activation. Here 
n denotes the number of neurons in the hidden layer. 

Batch Normalization (BN for short) consists of two steps. First we do zero- 
centering and normalization using the mean and variance w.r.t. examples in an 
associated batch B: 


1 1 
tp =— >) 29, of =— De® - uB (3.150) 
ieB MB eB 
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Figure 3.53. Batch Normalization (BN): First we do zero-centering and normalization 


with the mean ug and the variance of computed over the examples in an associated 


batch B. Next we do a customized scaling by introducing two new parameters learnable 
during training: y e R” and £ e R’. 


where (-)* indicates a component-wise square, hence 0; € R”. In other words, we 
generate the normalized output, say Znorm> as: 


2 — up 


Znorm = = 
Jog+e 


where division and multiplication are all component-wise. Here € is a tiny value 


(3.151) 


introduced to avoid division by 0 (typically 1075). 
Second, we do a customized scaling as per: 


ZO =y 48 (3.152) 


where y , 8 € R” indicate two new scaling parameters which are learnable via train- 
ing. Again, the operations in (3.152) are all component-wise. 

BN lets the model learn the optimal scale and mean of the inputs for each hidden 
layer. This technique is quite instrumental in stabilizing and speeding up training 
especially for a very deep neural network. This has been verified experimentally by 
many practitioners. 


Model for discriminator As a discriminator model, we use a 3-layer fully- 
connected network with two hidden layers; see Fig. 3.54. The input size must be 
the same as that of the flattened real (or fake) image. Again we employ ReLU at 
hidden layers and logistic activation at the output layer. 


TensorFlow: How to use BN? Loading MNIST data is the same as before — so 
we omit it. Instead we discuss how to use BN. TensorFlow provides a built-in class 
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Figure 3.54. Discriminator: A 3-layer fully-connected neural network where the input size 
(the dimension of a flattened vector of a real (or fake) image) is 784 (=28 x 28); the num- 
bers of hidden neurons are 512, 256; and the output size is 1. We employ ReLU activation 
at every hidden layer, and logistic activation at the output layer. 


for BN: 
BatchNormalizationO 


This is placed in tensorflow.keras.layers. Here is how to use the class in our setting: 


from tensorflow.keras.models import Sequential 

from tensorflow.keras.layers import Dense 

from tensorflow.keras.layers import BatchNormalization 
from tensorflow.keras.layers import ReLU 


generator = SequentialQ 
generator.add(Dense(128,input_dim=latent_dim)) 
generator.add(BatchNormalizationQ) 
generator.add(ReLUQ) 
generator.add(Dense(256)) 
generator.add(BatchNormalizationQ) 

# 


where latent_dim is the dimension of the fake input signal (which we set as 100). 


TensorFlow: Models for generator & discriminator Using the deep neural 
networks for generator and discriminator illustrated in Figs. 3.52 and 3.54, we can 
implement a code as below. 

from tensorflow.keras.models import Sequential 

from tensorflow.keras.layers import Dense 


from tensorflow.keras.layers import BatchNormalization 
from tensorflow.keras.layers import ReLU 
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latent_dim =100 

generator=SequentialQ 
generator.add(Dense(128,input_dim=latent_dim)) 
generator.add(BatchNormalizationQ) 
generator.add(ReLUQ) 
generator.add(Dense(256)) 
generator.add(BatchNormalizationO) 
generator.add(ReLUQ) 
generator.add(Dense(512)) 
generator.add(BatchNormalizationQ) 
generator.add(ReLUQ) 
generator.add(Dense(1024)) 
generator.add(BatchNormalizationO) 
generator.add(ReLUQ) 
generator.add(Dense(28*28,activation=’sigmoid’)) 


from tensorflow.keras.models import Sequential 
from tensorflow.keras.layers import Dense 
from tensorflow.keras.layers import ReLU 


discriminator=SequentialQ 
discriminator.add(Dense(512,inoput_shape=(784,))) 
discriminator.add(ReLUQ) 
discriminator.add(Dense(256)) 
discriminator.add(ReLUQ) 
discriminator.add(Dense(1,activation= ’sigmoid’)) 


TensorFlow: Optimizers for generator & discriminator We use Adam 
optimizers with Ir=0.0002 and (b1,b2)=(0.5,0.999). Since we have two models 
(generator and discriminator), we employ two optimizers accordingly: 


from tensorflow.keras.optimizers import Adam 

Ir = 0.0002 

b1=0.5 

b2 = 0.999 # default choice 

optimizer_G = Adam(learning_rate=lr, beta_1=b1) 
optimizer_D = Adam(learning_rate=Ir, beta_1=b1) 


TensorFlow: Generator input As a generator input, we use a random signal 
with the Gaussian distribution. In particular, we use: 


latent dim 
xeR i ~N(0, Ilatent dim). 
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Here is how to generate the Gaussian random signal in TensorFlow: 


from tensorflow.random import normal 
x = normal([batch_size,latent_dim]) 


TensorFlow: Binary cross entropy loss Consider the batch version of the 
GAN optimization (3.148): 


1 ; . 
min max E Š log Do(y) + log(1 — Do(Gy(x))). (3.153) 
a B eB 


We introduce the ground-truth real-vs-fake indicator vector [1,0]? (real = 1, 
fake = 0). Then, the term log Dg(y) can be viewed as the minus binary cross 
entropy between the real/fake indicator vector and its prediction counterpart 
[Do(y), 1 — Dey)": 


log Do(y) =|. log Do(y) +0- log(1 — Do(y)) 


| (3.154) 
= —lsce(1 ; Do (y)). 


On the other hand, another term log(1 — Dg (j)) can be interpreted as the minus 
binary cross entropy between the fake-vs-real indicator vector (fake = 0, real = 1) 
and its prediction counterpart: 


log(1 — Dg(#)) = 0- log Dg( J) + 1 -log — DGO) 


(3.155) 
= —€gce(0, Do (9). 
We see that cross entropy plays a role in the computation of the objective func- 
tion. TensorFlow offers a built-in class for cross entropy: BinaryCrossentropy(). 
This is placed in tensorflow.keras.losses. Here is how to use it in our setting: 


from tensorflow.keras.losses import BinaryCrossentropy 
CE_loss = BinaryCrossentropy(from_logits=False) 
loss = CE_loss(real_fake_indicator, output) 


where output denotes discriminator output, and real_fake indicator is real/fake 
indicator vector (real = 1, fake = 0). Here output is the result after logistic acti- 
vation; and real fake indicator is also a vector with the same dimension as output. 
The function BinaryCrossentropy() automatically detects the number of examples 
in an associated batch, thus yielding a normalized version (through division by mg). 
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TensorFlow: Generator loss Recall the proxy (3.149) for the generator loss 
that we will use: 


1 y 
min — *_ — log Do (Gu (x®)) 
” MB eB 
j (3.156) 
L min — S leall, Do(Gu(x))) 
«1B icb 
where (a) follows from (3.154). We can use the function CE_loss implemented 
above to write a code as below: 
g_loss = CE_loss(valid, discriminator(gen_imgs)) 


where gen imgs indicate fake images (corresponding to Gy(x)’s) and valid 
denotes an all-1’s vector with the same dimension as gen imgs. 


TensorFlow: Discriminator loss Recall the batch version of the optimization 
problem: 
1 , , 
max — J log Da(y) + log(1 — Do (Gy(x))). 
0 mg ieB 
Taking the minus sign in the objective, we obtain the equivalent optimization: 


: 1 i À 
min > Š = log Da(y) — log — Da(Gu(«))) 
ieb 


discriminator loss 


where the discriminator loss is defined as the minus version. Using (3.154) 
and (3.155), we can implement the discriminator loss as: 
real_loss = CE_loss(valid, discriminator(real_imgs)) 


fake_loss = CE_loss(fake, discriminator(gen_imgs)) 
d_loss = real_loss + fake_loss 


where real imgs indicate real images (corresponding to y®’s) and fake denotes an 
all-0’s vector with the same dimension as gen imgs. 


TensorFlow: Training Using all of the above, one can implement a code for 
training. We leave details in Prob 11.6. 


Look ahead Over the past sections, we have investigated two machine learning 
applications: one pertains to supervised learning, while the other relates to unsu- 
pervised learning. In the upcoming section, we will delve into the final application, 
which pertains to a societal issue in machine learning and is linked to mutual infor- 
mation: fair machine learning. 
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Problem Set 11 


Prob 11.1 (Generative Adversarial Networks) Consider a GAN with gener- 
ator G(-) and discriminator D(-). Let Y be a random variable that takes one of 
real samples {y}” N with probability +, and Qy be such probability distribution. 
Similarly we define Y and Qs p for fake samples JO = G(x) where x indicates 
an input to G(-), 7 € {1,...,m}. 

Let T ~ Bern(5) and 


_ Y, T=1; 
oe (3.157) 
Y, T=0. 
Let 
Gi aoe: Y). (3.158) 


(a) Show that Gù, is the same as 
G= ag ma JS(Qy||Qy). 


(b) Show that Gj, is the same as 


Cn log D(Y)] + Epllog(1 — D(Y))]. 
Lae ce oe Ey log D(Y)] + Ey [log( (Y))] 

(c) Suppose that the neural network class M can represent any arbitrary func- 
tion. Show that the solution for G(-) to the following optimization 


— Slog D(y) + log — DG” 1 
aang Wig (y) + log( (y"") (3.159) 


converges to Gan aS mM —> CO. 

(d) In Section 3.16, we mentioned that (3.159) is the GAN optimization. 
Explain what the objective function in (3.159) means in the context of a 
two-player game in which one player (discriminator) wishes to discriminate 
real samples against fake ones while the other player (generator) wants to 
fool the discriminator. 


Prob 11.2 (A proxy for the generator loss in GAN) Consider the optimiza- 
tion problem for GAN in Section 3.16: 


— Slog D(y) + log — DG 3.160 
Bic Lt a) a 
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where M indicates a class of neural networks, and y and 7% := G(x) denote 
real and fake samples respectively. Here x“ denotes an input to the generator, and 
m is the number of examples. Suppose that the inner optimization is solved to yield 
D*(-). Then, the optimization problem becomes: 


m 


1 , , 
in — N` log D*(y) + log(1 — D*(3®)). 161 
amin. = 2 log D*O) + log GP) (3.161) 


i=1 


(a) Show that the optimization problem (3.161) is equivalent to: 


ee 
in — N` log(1 — D*(7)). 162 
Gi og( Gg’) (3.162) 


(b) Let w be the weights of the generator. Show that 


dlog(1—D*(j))  ı 1 dD* (9) dy Ba 
dw — B2 ©>(0®)-1 JO dw’ l 

d(- log (GO) 1 -1 dD* (9) dO (3.164) 
dw ~ In2D*(G®) dO dw’ 


(c) Suppose that the discriminator works almost optimally, i.e., D* (7) is very 
close to 0. Which is larger in magnitude between (3.163) and (3.164)? 
Instead of solving (3.162), people prefer to solve the following for G(-): 


m 


1 , 
in — ` — log D (50). 3.165 
grin = Du og D* (5) (3.165) 


i=1 
Explain the rationale behind this alternative. 


Prob 11.3 (Batch normalization) Consider a deep neural network. Let z® := 
(2, ..., 20] be the output of a hidden layer prior to activation for the 7th exam- 
ple where 7 € {1,2,...,m} and m is the number of examples. Here n denotes the 


number of neurons in the hidden layer. 


(a) Let 


lee lwo. gy 
u=- 5O, o= = > @® = ny (3.166) 
i=1 i=l 
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(b) 
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where (-)* indicates a component-wise square, hence a? € R”. Consider 


: © 
20 


norm ~~ (3.167) 
oe +e 
29 = yam tp (3.168) 


where y, p € R”. Again the division and multiplication are all component- 
wise. Here € is a tiny value introduced to avoid division by 0 (typically 
1075). This is called a smoothing term. Assuming that € is negligible and 
z®’s are independent over 7, what are the mean and variance of 2? 
Many researchers employ 2 instead of z“ during training. These oper- 
ations include zero-centering and normalization (hence it is named batch 
normalization), followed by rescaling and shifting with two new parameters 
(y and £) which are learnable via training. In other words, these operations 
let the model learn the optimal scale and mean of the inputs for each layer. 
This technique plays a role in stabilizing and speeding up training espe- 
cially for a very deep neural network. This has been verified experimentally 
by many practitioners. 

In practice, this operation is done over the current mini-batch, so the 
whole procedure is summarized as follows: for the current mini-batch B 
with the size mp, 


IS wo 2 1 Dio 2 
ug =— > 2”, of =— > (” — up) 
mB 2 mg i 
a (3.169) 
2) m = 2 — 2E z0 = yz) + £. 
og +e 


At test time, there is no mini-batch to compute the empirical mean and 
standard deviation. Then, what can we do? Suggest a way to handle this 
issue and explain the rationale behind your suggestion. You may want to 
consult with some well-known literature if you wish. 


Prob 11.4 (Function optimization) Let Y ~ Py and Yr Po. Consider: 


(a) 


Py®) | i Py(Y) 
bo — N) her be — 7 
or oe E + | a Pd 5| (3.170) 


For x < 1, show that log(1 — x) is concave in x. 
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(6) Show that (3.170) is the same as: 


pope eee OO Elles D(Y))I. (3.171) 


Prob 11.5 (A lower bound of mutual information) Let L and X be random 
variables. Define Y := G(X, L) for a function G(., -). Show that 


I(L;Y) > E, pllog QWLIY)] + HL) (3.172) 


for some conditional distribution Q(-|-). 


Prob 11.6 (TensorFlow implementation of GAN) Consider Goodfellow’s 
GAN that we learned in Section 3.16. In this problem, you are asked to build a 
simple GAN that generates MNIST style handwritten digit images. We employ a 
5-layer neural network for generator with ReLU at all the hidden layers and logistic 
activation at the output layer. 


(a) (MNIST dataset loading) Use the following script (or otherwise), load the 
MNIST dataset: 
from tensorflow.keras.datasets import mnist 
(X_train,y_train),(X_test, y_test)=mnist.load_dataQ 
X_train = X_train/255. 
X_test = X_test/255. 


Explain the role of the following script: 


import numpy as np 
def get_batches(data, batch_size): 
batches = [] 
for i in rangecint(data.shape[0]//batch_size)): 
batch=data[i*batch_size:(i+1)*batch_size] 
batches.append(batch) 
return np.asarray(batches) 


(b) (Data visualization) Assume that the code in part (a) is executed. Using a 
skeleton code provided in Prob 10.8(6), write a script that plots 60 images 
in the first batch of X_train in one figure. Also plot the figure. 

(c) (Generator) Draw a block diagram for generator implemented by the 
following: 

from tensorflow.keras.models import Sequential 
from tensorflow.keras.layers import Dense 


from tensorflow.keras.layers import BatchNormalization 
from tensorflow.keras.layers import ReLU 


latent_dim =100 
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generator=SequentialQd 
generator.add(Dense(128,input_dim=latent_dim)) 
generator.add(BatchNormalizationO) 
generator.add(ReLUQ) 
generator.add(Dense(256)) 
generator.add(BatchNormalizationQ) 
generator.add(ReLUQ) 
generator.add(Dense(512)) 
generator.add(BatchNormalizationQ) 
generator.add(ReLUQ) 
generator.add(Dense(1024)) 
generator.add(BatchNormalizationQ) 
generator.add(ReLUQ) 
generator.add(Dense(28*28,activation=’sigmoid’)) 


(d) (Generator check) Upon the above codes being executed, report an output 
for the following: 


from tensorflow.random import normal 
import matplotlib.pyplot as plt 


batch_size = 64 

x = normal([batch_size,latent_dim]) 
gen_imgs = generator.predict(x) 
gen_imgs = gen_imgs.reshape(-1,28,28) 


num_of_images = 60 

for index in range(|,num_of_images+1): 
plt.subplot(6,10, index) 
plt.axis( off’) 
plt.imshow(gen_imgs[index], cmap = ’gray_r’) 


(e) (Discriminator) Draw a block diagram for discriminator implemented by 
the following: 


from tensorflow.keras.models import Sequential 
from tensorflow.keras.layers import Dense 
from tensorflow.keras.layers import ReLU 


discriminator=SequentialQ 
discriminator.add(Dense(512,input_shape=(784,))) 
discriminator.add(ReLUQ) 
discriminator.add(Dense(256)) 
discriminator.add(ReLUQ) 
discriminator.add(Dense(l,activation='sigmoid’)) 
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(f) (Training) Suppose we construct the generator and discriminator as follows: 


from tensorflow.keras.layers import Input 
from tensorflow.keras.models import Model 
from tensorflow.keras.optimizers import Adam 


adam = Adam(learning_rate=0.0002, beta_1=0.5) 


# discriminator compile 

discriminator.compile(loss=’binary_crossentropy’, 
optimizer=adam) 

# freeze disc’s weights while training generator 

discriminator.trainable = False 


# define GAN with fake input and disc. output 

gan_input = Input(shape=(latent_dim,)) 

x = generator(inputs=gan_input) 

output = discriminator (x) 

gan = Model(gan_input, output) 

gan.compile(loss=’binary_crossentropy’, 
optimizer=adam) 


where generator() and discriminator() are the classes designed in parts (c) 
and (e), respectively. 

Explain how generator and discriminator are trained in the following 
code: 


import numpy as np 
from tensorflow.random import normal 


EPOCHS = 50 

k=2 # k:1 alternating gradient descent 
d_losses = [] 

g_losses = [] 


for epoch in range(1,EPOCHS + 1): 

# train per each batch 

np.random.shuffleCX_train) 

for i, real_imgs in enumerate(get_batches(X_train, 

batch_size)): 

HHHHHHHHHHHHHHREHAAAHE 
# train discriminator 
HHHHHHHHHHHHHHRAHAAAE 
# fake input generation 
gen_input = normal([batch_size,latent_dim]) 
# fake images 
gen_imgs = generator.predict(gen_input) 
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real_imgs = real_imgs.reshape(-1,28*28) 

# input for discriminator 

d_input = np.concatenate([real_imgs,gen_imgs]) 

# label for discriminator 

# (first half: real (1); second half: fake (O)) 

d_label = np.zeros(2*batch_size) 

d_label[:batch_size] = 1 

# train Discriminator 

d_loss = discriminator.train_on_batch(d_input, d_label) 


HHHHHHHEHHHHHHRAAAAAA 
# train generator 
HHHHHHHHHHHHHHREHAAAHA 
if i%k: # 7:k alternating gradient descent 
# fake input generation 
g_input = normal ([batch_size,latent_dim]) 
# label for fake image 
# Generator wants fake images to be treated 
# as real ones 
g_label = np.ones(batch_size) 
# train generator 
g_loss = gan.train_on_batch(g_iput, g_label) 


d_losses.append(d_loss) 
g_losses.append(g_loss) 


(g) (Training check) For epoch = 10, 30,50, 70,90: plot a figure that shows 
25 fake images from generator trained in part (f) or by other methods of 
yours. Also plot the generator loss and discriminator loss as a function of 
epochs. Include Python scripts as well. 


Prob 11.7 (Minimax theorem) Let f(x,y) be a continuous real-valued function 
defined on ¥ x Y such that 


(i) f(x, y) is convex in x € X Vy € YV; and 
(ii) f (x,y) in concave in y € Y Yx € X 


where ¥ and yY are convex and compact sets. 
Note: You do not need to solve the optional problems below. 


(a) Show that 


min ae (x,y) > gees fy). (3.173) 


Does (3.173) hold also for any arbitrary function f (-, -)? 
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(4) Suppose 


< mi A < i sy): 174 
a Snip me ey) ee oy WD) (3.174) 


Then, argue that (3.174) implies: 


i < in f(x,y). 1 
m (sy) < oa (x,y) (3.175) 


(c) Suppose that a < minyex maxyey f (x, y). Then, show that there are finite 
Yi» -- ->Yn E Y such that 


a<min max f(x,y). (3.176) 


(d) (Optional) Suppose that a < minyex Maxye{y,,y} f (x,y) for any y1, y2 € 
V. Then, show that there exists yọ € VY such that 


a < minf (x, yo). (3.177) 
xEX 


(e) (Optional) Suppose that a < minyex Maxyety,,....7,} f (xy) for any finite 


Josey’ 


Yi» -- -Jn E VY. Then, show that there exists yo € VY such that 
a < min f(x, yo). (3.178) 
xEX 


Hint: Use the proof-by-induction and part (d). 


Note: (3.178) implies that a < maxyey minyex f(x,y). This together with the 
results in parts (4) and (c) proves (3.175). Combining this with (3.173) proves the 
minimax theorem: 


i N= i y). 3.179 
min man (x,y) na min f (x, y) (3.179) 
Prob 11.8 (Training instability) Consider a function: 


f(x,y) = (2 + cos x) (2 + cos y) (3.180) 


where x,y € R. 


(a) Solve the following optimization (i.e., find the optimal solution as well as 
the points that achieve it): 


min max f (x, y). (3.181) 
X J 
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(4) Solve the reverse version of the optimization: 


max min f (x, y). (3.182) 
J x 


(c) Suppose that we perform 1 : 1 alternating gradient descent for f (x, y) with 
an initial point (x, y) = (z +0.1, —0.1). Plot f(x, y) as a function 
of £ where (x, y) denotes the estimate at the tth iteration. What are the 
limiting values of (x, y®)? Also explain why. 

Note: You may want to set the learning rates properly so that the convergence 
behaviour is clear. 

(d) Redo part (c) with a different initial point (©, y) = (0.1,z — 0.1). 


Prob 11.9 (Alternating gradient descent) Consider a function: 
CS ase (3.183) 
where x,y € R. 


(a) Solve the following optimization: 


min max f (x, y). (3.184) 
x J 


(b) Suppose that we perform 1 : 1 alternating gradient descent for f(x, y) with 
an initial point (x, y) = (1, 1). Plot f (x®, y) as a function of t where 
(x, y) denotes the estimate at the tth iteration. What are the limiting 
values of (x, y)? Also explain why. 

Note: You may want to set the learning rates properly so that the convergence 


behaviour is clear. 
(c) Redo part (c) with a different initial point (x, y) = (-1,-1). 


Prob 11.10 (True or False?) 
(a) Consider the following optimization: 


min max x? — y. 
xeR yeR 


With 1:1 alternating gradient descent with a proper choice of the learning 
rates, one can achieve the optimal solution to the above. 
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(b) Consider the following optimization: 


min max(2 + cos x)(2 + cos y). 

xER yeR 
Suppose we perform 1:1 alternating gradient descent with a proper choice 
of the learning rates. Then, the converging points can be distinct depending 
on different initial points. 
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3.18 Fair Machine Learning and Mutual Information (1/2) 


Recap Throughout the preceding sections, we have investigated two prominent 
methodologies for machine learning: (i) supervised learning; and (ii) unsupervised 
learning. We found that cross entropy plays a pivotal role in designing the opti- 
mal loss function for supervised learning. Additionally, we established a fascinat- 
ing relationship between GANs (an unsupervised learning framework) and two 
information-theoretic notions: the KL divergence and mutual information. 


Next application As the final application, we will delve into a recent topic in 
machine learning: Fair machine learning. There are three reasons for choosing this 
topic. Firstly, as machine learning becomes increasingly prevalent in various appli- 
cations, such as medicine, finance, job hiring and criminal justice, it is essential to 
ensure fairness for all groups involved. This morally and legally motivated need has 
gained significant attention in the design of machine learning algorithms, particu- 
larly with regards to the fairness issue highlighted in the learning algorithm used in 
the US Supreme Court, which yielded unbalanced recidivism scores across differ- 
ent races (Larson eż al., 2016). Thus, this important societal topic will be discussed 
in this book. Secondly, we will explore the connection between information theory 
and fair machine learning, specifically the role of mutual information in formu- 
lating an optimization problem for these algorithms. Finally, we will examine how 
the associated optimization problem is closely related to the GAN optimization 
we learned in the past sections, creating a coherent sequence of applications from 
supervised learning, GANS to fair machine learning. 


During upcoming lectures Over the next couple of sections, we will thor- 
oughly explore fair machine learning. We will cover four parts. Firstly, we will 


fake 
samples 


neural 
network 


J= fult) generator 


: ; j i)m 
[yO {a}, 
(a) (b) 
Figure 3.55. (a) Supervised learning: Learning the function f(-) of an interested system 


from input-output example pairs {(x”, y)}™,; (b) Generative modeling (an unsupervised 
learning methodology): Generating fake data that resemble real data, reflected in {x}. 
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Figure 3.56. Machine learning-based recidivism score predictor of the US Supreme 
Court: Black defendants were 77.3 percent more likely than white defendants to receive 
high recidivism scores. 


define fair machine learning and its purpose. Following this, we will examine 
two widely used fairness concepts found in current literature. We will then for- 
mulate an optimization framework for fair machine learning algorithms that 
abide by the constraints of fairness inspired by these concepts. Furthermore, we 
will establish a connection between mutual information and the optimization, 
drawing parallels to GANs. Lastly, we will learn how to solve the optimization 
problem and implement it in TensorFlow. In this section, we will cover the 
first two. 


Fair machine learning Fair machine learning refers to a specific area of machine 
learning that is concerned with fairness. It can be defined as a field of algorithms 
that train a machine to perform a given task in a fair manner. Fair machine learn- 
ing can be divided into two main methodologies, just like traditional machine 
learning. The first is fair supervised learning, where the goal is to develop a fair 
classifier or predictor using a set of input-output sample pairs. The second is 
the unsupervised learning counterpart, which includes fair generative modeling. 
This aims to produce synthetic data that is both realistic and fair in terms of the 
statistics of the generated samples. In this book, we will focus on fair supervised 
learning. 


Two major concepts on fairness To develop a fair classifier, it is necessary to 
grasp the meaning of fairness. The term “fairness” is rooted in law and has a lengthy 
and substantial history, with many concepts in the legal field. For our purposes, 
we will concentrate on two well-known concepts that have garnered significant 
attention in recent literature. 

The first concept we will discuss is known as disparate treatment (DT). This 
concept is centered around unequal treatment that results from sensitive attributes 
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Figure 3.57. A criminal reoffending predictor. 


such as race, sex, or religion. It is sometimes referred to as direct discrimination, as 
these attributes directly lead to discrimination. 

The second concept we will focus on is disparate impact (DI). This term is used to 
describe a situation where one group is adversely affected compared to another, even 
when neutral rules are in place. Neutral rules are those where sensitive attributes are 
not considered in classification, thereby preventing any instances of DT. Disparate 
impact is also known as indirect discrimination because the biased historical data 
leads to a disparate outcome indirectly. 


Criminal reoffending predictor How can we design a fair classifier that meets 
both the disparate treatment and disparate impact fairness criteria? To make it eas- 
ier, we will examine this in the context of a simple prediction scenario: forecasting 
criminal reoffending. The goal is to forecast if a person who has a criminal record 
is likely to reoffend within two years, and this has been used by the US Supreme 
Court in deciding parole. 


A simple setting For illustrative purpose, we will examine a simplified version 
of the predictor and visualize it in Fig. 3.57. The predictor uses two types of data: 
(i) objective data; and (ii) sensitive data (sensitive attributes). For objective data 
denoted by x, we only consider two features, xı and x2. The variable x; represents 
the number of prior criminal records, while x2 represents the criminal type, such 
as misdemeanour or felony. For sensitive data, we use a different notation z. We 
consider a simple case in which z is binary, indicating only the race type of the 
individual, either white (z = 0) or black (z = 1). Let ĵ be the classifier output 
which aims to represent the ground-truth conditional distribution P(y|x, z). Here 
y denotes the ground-truth label: y = 1 means reoffending within 2 years; y = 0 


otherwise. This is a supervised learning setup, so we are given m example triplets: 
(O, 2, yO). 


How to avoid disparate treatment? Firs of all, how to deal with disparate 
treatment? Recall the DT concept: An unequal treatment directly because of 
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sensitive attributes. Hence, in order to avoid the DT, we should ensure that the pre- 
diction should not be a function of sensitive attributes. Mathematically, it means: 


PQyl|x,z) = PQ|x) Vz. (3.185) 


How to ensure the above? The solution is very simple: Not using the sensitive 
attribute z at all in prediction, as illustrated with a red-colored “x” mark in Fig. 3.57. 
Sensitive attributes are offered as part of training data although they are not used 
for an input. In other words, we employ zs only in the design of an algorithm. 


What about disparate impact? How about for the other fairness criterion 
regarding disparate impact? How to avoid the DI? Again recall the DI concept: An 
action that adversely affects one group against another even with formally neutral 
rules. Actually it is not that clear as to how to implement this. 

To gain some insights, let us investigate the mathematical definition of DI. To 
this end, we introduce a few notations. Let Z be a random variable for a sensitive 
attribute. For instance, consider a binary case, say Z € {0,1}. Let Y be a binary 
hard-decision value of the predictor output Ê at the middle threshold: Y := 1{Y > 
0.5}. Observe a ratio of likelihoods of positive example events Y = 1 for two cases: 
Z=OandZ=1. 


P(Y = 1|Z = 0) 


= (3.186) 
P(Y = 1|Z = 1) 


One natural interpretation is that a classifier is more fair when the ratio is closer to 
1; becomes unfair if the ratio is far away from 1. One quantification for the degree 
of fairness regarding the DI was proposed by (Zafar et al., 2017): 


DI := nin( ilis Ji ZA iia iaa 2) (3.187) 
P(Y = 1|Z=1) P(Y = 1|Z = 0) 


Notice that 0 < DI < 1 and the larger DI, the more fair the situation is. 


Two cases In view of the mathematical definition (3.187), reducing disparate 
impact means maximizing the quantity (3.187). How to design a classifier so as to 
maximize the DI? Depending on situations, the design methodology can be differ- 
ent. To see this, think about two extreme cases. 

The first refers to a case in which training data already respects fairness: 


È, 29, YO), — large DI. 


In this case, a natural solution is to rely on a conventional classifier that aims to max- 
imize prediction accuracy. Why? Because maximizing prediction accuracy would 
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Figure 3.58. Visualization of a historically biased dataset: A hollowed (or black-colored- 
solid) circle indicates a data point of an individual with white (or black) race; the red (or 
blue) colored edge denotes y = 1 reoffending (or y = O non-reoffending) label. 


well respect training data, which in turn yields a large DI. The second is a non- 
trivial case in which training data is far from being fair: 


{(e9, 2, yO)", — small DI. 


In this case, the conventional classifier would yield a small DI. This is indeed a 
challenging scenario where we need to take some non-trivial action for ensuring 
fairness. 

In reality, the second scenario is often observed due to the existence of biased 
historical records that form the basis of the training data. For example, the decisions 
made by the Supreme Court may be biased against certain races, and these decisions 
are likely to be included as part of the training data. Fig. 3.58 illustrates one such 
biased scenario, where a hollow or black-colored solid circle represents a data point 
for an individual of white or black race, respectively, and the red or blue colored 
edge denotes the event of the individual reoffending or not reoffending within two 
years. This is a biased situation, as there are more black-colored solid circles than 
hollow ones for positive examples where y = 1, indicating a bias in historical records 
favoring whites over blacks. Similarly, for negative examples where y = 0, there are 
more hollow circles than solid ones. 


How to ensure a large DI? How can we guarantee a high level of DI in all 
scenarios, including the challenging one described above? To gain a better under- 
standing, let us revisit an optimization problem we previously formulated in the 
development of a traditional classifier: 


. 1 i) Ali 
min — X lce(y, 9) (3.188) 
w m = 


m 
i= 
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where €¢¢(-, +) indicates binary cross entropy loss, and w denotes weights (param- 
eters) of a classifier. One natural approach to encourage a large DI is to incorporate 
an Dl-related constraint. Maximizing DI is equivalent to minimizing 1 — DI (since 
0 < DI < 1). We can resort to a well-known technique in optimization: regular- 
ization. That is to add the two objectives with different weights. 


Regularized optimization Here is a regularized optimization: 


i< zaț 
min — S lË, 99) +4. (1 — Dl) (3.189) 
w m 


i=1 


where À denote a regularization factor that balances predication accuracy against 
the Dl-associated objective (minimizing 1 — DI). However, an issue arises in solving 
the regularized optimization (3.189). Recalling the definition of DI 


plex minf PX Ea 1NZ=9 PW = 112 = 1) 
= PY =1|/Z=1) PY =1|Z =0) J’ 


we see that DI is a complicated function of w. We have no idea as to how to express 
DI in terms of w. 


Another way It is not feasible to express DI as a function of w, so we can con- 
sider an alternative approach inspired by information theory, specifically mutual 
information. If DI = 1, then the sensitive attribute Z and the hard decision Y are 
independent. Mutual information has a significant property that the mutual infor- 
mation between two random variables is zero when the two variables are indepen- 
dent, and it is a “sufficient and necessary condition.” This motivates us to represent 
the constraint of DI = 1 as follows: 


I(Z;Y) =0. (3.190) 


This captures the independence between Z and Y. Since the predictor output is Y 
(instead of Y), we consider another stronger condition that concerns Y directly: 


1(Z; Y) =0. (3.191) 


The condition (3.191) is indeed stronger than (3.190), i.e., (3.191) implies (3.190). 
This is because 


AIA AS 
(3.192) 


® 1(Z;¥) 
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where (a) is due to the chain rule and non-negativity of mutual information; and 
(b) is because Y is a function of Y: Y := 1{Y > 0.5}. Notice that (3.191) together 
with (3.192) gives (3.190). 


Strongly regularized optimization In summary, the condition (3.191) indeed 
enforces the DI = 1 constraint. This then motivates us to consider the following 
optimization: 


ck sai a 
min — S bce(y, jO) +14 Z P). (3.193) 


i=1 


How to express I (Z; Y) in terms of classifier parameters w? Interestingly, there is a 
way to express it. The idea is intimately related to the GAN optimization that we 
learned. 


Look ahead In the next section, we will review the GAN briefly and use it to 
formulate an optimization for a fair classifier. 
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3.19 Fair Machine Learning and Mutual Information (2/2) 


Recap In the preceding section, we presented the last application of information 
theory: a fair classifier. We used a recidivism predictor as an instance of a fair clas- 
sifier, which aims to forecast whether an individual with previous criminal records 
would reoffend within two years, as shown in Fig. 3.59. To prevent disparate treat- 
ment, a prominent fairness concept, we excluded the sensitive attribute from the 
input. To incorporate another fairness notion, disparate impact (DI for brevity), 
we added a regularized term to the conventional optimization that only considered 
prediction accuracy, resulting in the following expression: 


i Ae oe i 
min — $ | boe(y,5) +4- Z; Ê) (3.194) 
i=1 


where J > 0 isa regularization factor that balances prediction accuracy (reflected in 
the binary cross entropy terms) against the fairness constraint, reflected in Z (Z; Y). 
Remember that /(Z; Y) = 0 is a sufficient condition for DI = 1. At the end, 
we claimed that one can express /(Z; Y) in terms of an optimization parameter w, 
thereby enabling us to train the model parameterized by w. The idea for translation 
is to use the GAN trick that we learned in the past sections. 


Outline In this section, we will support our claim. We will cover three parts in 
detail. Firstly, we will explain what it means by the “GAN trick”. Secondly, we will 
utilize the “GAN trick” to construct an optimization problem in a simple scenario 
where the sensitive attribute is binary. Finally, we will generalize this approach to 
situations where the sensitive attribute can take on any value from an arbitrary 


alphabet. 


recidivism 
predictor 


t 
ERORO 


Figure 3.59. A simple recidivism predictor: Predicting a recidivism score y from x = (x1, 
X2). Here xı indicates the number of prior criminal records; x2 denotes a criminal type 
(misdemeanor or felony); and z is a race type among white (z = O) and black (z = 1). 


> 


(x1, £2) = 7% 
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The GAN trick Recall the inner optimization in GANs: 


mar = 2 log D(y) + log(1 — D(G)) 


where y and 9 indicate the ith real and fake samples, respectively; and D(-) is a 
discriminator output. Also recall the connection with mutual information: 


2 La i ; 
I(T; Y) = a = Si log D(y) + log(1 — DG") (3.195) 
i=l 


where T = 1{discriminator’s input is real} and Y is a random variable that takes a 
real sample if T = 1; a fake sample if T = 0. 

Here what we mean by the GAN trick is the other way around, taking the reverse 
order. We start with mutual information and then express it in terms of an opti- 
mization problem similarly to (3.195). Now let us apply this trick to our problem 
setting. For illustrative purpose, we start with a simple binary sensitive attribute 
setting. 


Mutual information vs KL divergence In our optimization (3.194), Z is a 
sensitive attribute indicator, which plays the same role as T in (3.195). Similarly Y 
in (3.194) serves the same role as Y in (3.195). Hence, one can expect that /(Z; Y) 
in (3.194) would be expressed similarly as in (3.195). A slight distinction lies in a 
detailed expression. To see the distinction, we start by manipulating /(Z; Y) from 
scratch. 

Starting with the relationship between mutual information and KL divergence, 
we get: 


I(Z; ¥) = KL@y zilPpPz) 


Py 72) 
@ Pej VAM 
= Pz J, 2) log — nP 
>, KOZE) 


(7,2) 
=> P$ z092) log E 


JED, 2€Z 


1 
+ Pe (9, z) log —— 
i coe Ta 
JEVY, ZEZ 


Pp z(J,2) 
>> Py 7(j,2) log ———— PG) 
JED, 2€Z 
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1 
+ ¥ Pz(z)lo 
Ži E PZ@ 
; ; Po (9, z) 
2S Py (nz) log <2 +HZ) 
es i P(C) 
JEY, ZEZ 


where (a) is due to the definition of the KL divergence; (4) comes from the total 
probability law; and (c) is due to the definition of entropy. 


Observation For the binary sensitive attribute case, we have: 


p ‘ Py (9, z) 
IZ: Y= J, Ppz02) log = +HZ) 
ME j P 0) 
JEY ,ZEZ 
Py 7(%1) 
a Y,Z 
= $ P$ 7, 1) log Pe) 
jed hat 5 Lae (3.196) 
=:D*(5) 
: Py 7(j50) 
+ > Pp 28,0) log EO +H(Z). 
jey neat paa 
=1-D* ($) 


Notice that the first log-inside term, defined as D* (9), has the close relationship 
with the second log-inside term in the above: the sum of the two is 1. This reminds 
us of the objective function in the optimization (3.195). So one may conjecture 
that Z(Z; Y) can be expressed in terms of an optimization problem as follows: 


Theorem 3.1. The mutual information I(Z; Y) can be represented as the following 
function optimization: 


1Z¥) = bees > PyzGs Vlog DG) 
jeyd 
(3.197) 


+$ P¢,z(5,0) log (1 — DO) | + HZ). 
jd 


Proof: The optimization in (3.197) is convex in D(-), since the log function is con- 
cave and the concavity preserves under additivity. Hence, looking into the unique 
stationary point, we can prove the equivalence. Taking the derivative w.r.t. D(y), 
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we get: 


1 Po (9, 1) Po (9, 0) A 
p - _ a x = 0 Vy 
n2 \ Dæ 1- Dot (9) 


where Dopt(7) is the optimal solution to the optimization in (3.197). This gives: 


Py 7D — Py 71) 
Py (yD + Pp 799) PoC) 


Dopt (9) = 


where the second equality is due to the total probability law. Since Dopt( J) is the 
same as D* (Ĵĵ) that we defined in (3.196), we complete the proof. 


How to express /(Z;Y) in terms of w? The formula (3.197) contains two 
probability quantities (Pp 7( J, 1), P z( 9, 0)) which are not available. What we 
are given are: (0, 2, JONEL We need to worry about what we can do with 
this information for computing the probability quantities. To this end, we rely upon 
the empirical distribution: 


T0 p= 


EA 


Qy 7G, 0) = 


In practice, the empirical distribution is likely to be uniform, since 7 is real- 
valued and hence the pair (7, z) is unique with high probability. By applying 
these empirical distributions, we can approximate /(Z; Y) as: 


l a(i 
I(Z; Î) x X max — > log Diy 5O) + >. log (1 — DF ) 
pone iz®=1 nzO=0 (3. 198) 


+ H(Z). 


Implementable optimization (Cho et al, 2020) Recall the original 
optimization: 


igen be ahd z 
min — $ læ PSO) +14- 1: ¥), 
i=1 
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Applying the approximation (3.198) into the above, we get: 


1JY< aai 
minimax — pa lce (79,9) 
i=1 


w 0 m 


(3.199) 
+4 [ È logDo(9) + X log(1- D0) 


iz®=1 iz®=0 


where D(-) is parameterized with 0. The sensitive-attribute entropy H (Z) 
in (3.198) is removed, since it is irrelevant of the optimization parameters (0, w). 
The objective function has an explicit relationship with the optimization param- 
eters (0, w). Hence, the parameters are trainable via a practical algorithm. In the 
next section, we will discuss details on the algorithm. 


Extension to a non-binary sensitive attribute We previously focused on the 
binary sensitive attribute setting, but in practice, this may not always be the case. For 
example, there may be multiple race types, such as black, white, Asian, Hispanic, 
and multiple sensitive attributes, such as gender and religion. To account for these 
practical scenarios, we now consider a sensitive attribute with an arbitrary alphabet 
size. Multiple sensitive attributes can be represented as a single random variable 
with an arbitrary alphabet size. Therefore, we consider a setting where Z belongs 
to the set Z and the cardinality of Z is not limited to two. 

By recalling the relationship between mutual information and the KL diver- 
gence, we can obtain: 


IZ; Î)=KL (P; zIPpPz) 


Py (J, z) 
= Ps (9,2) log — =~ —— 
. Ss YZ J 8 PDPZ) 
yEYŅY,zEZ (3.200) 
Po (9 z) 
= Š. Po (z) log —4—— + H(Z). 
z i P(9) 
JEY, zEZ oS 
=:D* (7,2) 


Defining the log-inside term in the above as D*(j, z), we obtain: 
S DGD=1 fey. 
ZEZ 


This is due to the total probability law. Similar to Theorem 3.1, we can come up 
with the following equivalence. 
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Theorem 3.2. The mutual information I(Z; Y) can be represented as the following 
function optimization: 


W(Z;¥)= ma __ $. Pezz) log DY, 2) + HZ). 
Dj.2)! izes DUV.Z=1 , 
JEV, ZEZ 


(3.201) 


Proof: It is a convex optimization problem, but we have multiple equality con- 
straints. So we should take the Lagrange multiplier method which relies upon the 
KKT conditions (Karush, 1939; Kuhn and Tucker, 2014; Boyd and Vandenberghe, 
2004). First define the Lagrange function: 


L(D(j,2), V= $, Piz 2) log D(j, 2) 


FEY, 2EZ 
$ Dof: = ¥ 96.9) 
jey zEZ 


where v(7)’s are Lagrange multipliers. There are the number IPI of Lagrange mul- 
tipliers. We solve the problem via the KKT conditions: 


dL(D(},2),v(j)) 1 (F254 E 


ro) =0 Wie 


dD(j, z) ~ In2\ Doi, z) 
dL(D(j, 2), v(ĵ)) 1 : n 
= 1— Do ; = Vy. 
dG) h2 2 (yz) J=O Vy 
Plugging the following: 
n Pp z(9, 2) : n 
Dopt (9, z) = a Vopt (y) E Pp(y); 


Pej)” 
we satisfy the KKT conditions. This implies that Dopt( J, z) is indeed the optimal 


solution. Since Dopt(y, z) is the same as D* (9, z) that we defined in (3.200), we 
complete the proof. 


Implementable optimization: General case (Cho et al., 2020) Again for 
computation of /(Z; Y), we rely on the empirical version of the true distribution 


Pp (92): 


PER: 1 
Qp G9, 29) == Vie{1,..., m}. (3.202) 
i m 
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Figure 3.60. The architecture of the mutual information (Ml)-based fair classifier. The 
prediction output y is fed into the discriminator wherein the goal is to figure out sen- 
sitive attribute z from y. The discriminator output Dyg(y,z) can be interpreted as the 
probability that y belongs to the attribute z. Here the softmax function is applied to 
ensure the sum-up-to-one constraint. 


So we get: 


m 


i log DO”, 2) + H(Z 3.203 
(Z: Y) x a e Oo: og DG,2) + H(Z). (3.203) 


By parameterizing D(-,-) with @ and excluding H(Z) (irrelevant of (0, w)), we 
obtain the following optimization: 


min max 11S tee 5) 42> log Dot 5, al 


WO: „z Do(j.z)=1 M pam 


(3.204) 


The architecture of the fair classifier The architecture of the implementable 
optimization (3.204) is illustrated in Fig. 3.60. On top of a classifier, we introduce 
a new entity, called discriminator, which oan erona to the inner optimization. 
In discriminator, we wish to find 6* that maximizes — 1 | log Do (9, 2). On 
the other hand, the classifier wants to minimize the term. Taa Do (5, 2) can 
be viewed as the ability to figure out z from prediction y. Notice that the classifier 
wishes to minimize the ability for the purpose of fairness, while the discriminator 
has the opposite goal. One natural interpretation that can be made on Dg (j, 2) 
is that it captures the probability that z is indeed the ground-truth sensitive attribute 
for y. Here the softmax function is applied to ensure the sum-up-to-one constraint. 


Analogy with GANs Since the classifier and the discriminator are competing, 
one can make an analogy with GANs, in which the generator and the discriminator 
also compete like a two-player game. While the fair classifier and the GAN bear 
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Mi-based fair classifier GAN 
discriminator discriminator 
Goal: Figure out sensitive Goal: Distinguish real samples 
attribute from prediction from fake ones. 
classifier generator 


Maximize prediction accuracy | Generate realistic fake samples 


Figure 3.61. Ml-based fair classifier vs. GAN: Both bear similarity in structure (as illus- 
trated in Fig. 3.60), yet distinctions in role. 


strong similarity in their nature, these two are distinct in their roles. See Fig. 3.61 
for the detailed distinctions. 


Look ahead The optimization formulation of a fair classifier has been covered 
in this section. In the upcoming section, we will examine a method to solve the 
optimization problem (3.204) and how to implement it in TensorFlow. 
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3.20 Fair Machine Learning: TensorFlow Implementation 


Recap Previously we formulated an optimization that respects two fairness con- 
straints: disparate treatment (DT) and disparate impact (DI). Given m example 
triplets {, 2, yO): 


i i x 
min — X lce(y, 9) +14- 1(Zs¥) 
w M pi 


where j indicates the classifier output, depending only on x (not on the sensitive 
attribute z® due to the DT constraint); and À is a regularization factor that balances 
prediction accuracy against the DI constraint, quantified as /(Z; Y). Using the 
connection between mutual information and KL divergence, we could approximate 
I(Z; Y ) in the form of optimization: 


m 


I(Z;Y) ~ H(Z)+ max — log DG, 2). (3.205) 
$, 2U.2= 2 8 


We then parameterized D(-) with @ to obtain: 


min max ap > Lely © ,3®) +2 3 log Do GO, z®) | (3.206) 


w 0:5., Do(j.z)=1 M = 


Two questions that arise are: (i) how to solve the optimization (3.206)?; and (ii) how 
to implement it via TensorFlow? 


Outline In this section, we will tackle the two questions. What we are going to 
do are four folded. Firstly, we will explore a practical algorithm to tackle optimiza- 
tion (3.206). Secondly, we will conduct a case study, focusing on recidivism predic- 
tion to exercise the algorithm. We will emphasize a specific implementation detail — 
synthesizing an unfair dataset. Thirdly, we will discuss how to implement this algo- 
rithm using TensorFlow. We will focus on a binary sensitive attribute setting for 
illustrative purposes. 


Observation Lers begin by translating the optimization (3.206) into a version 
that is more friendly for programming: 


min max 11S fee! å) ESS 9, ’) a 


w 0:$ ,Da(ĵ,z)=1 M p 
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where (a) is because we consider a binary sensitive attribute setting and we denote 
Do (7, 1) simply by Do (9); (b) is due to z® e {0,1}; (c) follows from the 
definition of binary cross entropy loss €ce(-, +); and (d) comes from G(x) := 
AC) 
Jj”. 

Notice that J (w, 0) contains two cross entropy loss terms, each being a non- 
trivial function of G,,(-) and/or Dg(-). Hence, in general, /(w,@) is highly non- 


convex in w and non-concave in 0. 


Alternating gradient descent Similar to the prior GAN setting in Sec- 
tion 3.17, what we can do is to apply the only technique that we are aware of: 
alternating gradient descent. And then hope for the best. We employ & : 1 alter- 
nating gradient descent: 


1. Update classifier (generator)’s weight: 
wD e yw — ai Vu (w, 069). 
2. Update discriminator’s weight & times while fixing w+): for i=1:k, 
GEED e GERD 4 az Vg] (wEtD, QCD), 


3. Repeat the above. 
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Similar to the GAN setting, one can use the Adam optimizer possibly together with 
the batch version of the algorithm. 


Optimization used in our experiments Here is the optimization that we will 
use in our experiments: 


min max — pa — Aece(y®, Gu(x)) T ie”, Duals} : 
i=1 
(3.207) 


In order to restrict the range of A into 0 < 4 < 1, we apply the (1 — J) factor to 
the loss term w.r.t. prediction accuracy. 

Like the prior GAN setting, we define two loss terms. One is “classifier (or gen- 
erator) loss”: 


min max — : Apa — Alc (y, Gy(x®)) — A€ce(2, Do (Gy eo». 


~ 
“classifier (generator) loss" 


Given w, discriminator wishes to maximize: 
T. 5 bce (2, Do (Gu(x))). 
0 E = > w 


This is equivalent to minimizing the minus of the objective: 


A , , 
min — D lce(e, Da(Gu(x))). (3.208) 
i=1 


~ 


“discriminator loss" 
This is how we define “discriminator loss”. 
Performance metrics We introduce a performance metric that captures the 


degree of fairness. To this end, we first define the hard-decision value of the predic- 
tion output w.r.t. a test example: 


Vies = HFa > 0.5}. 


The test accuracy is then defined as: 


mMtest ( X ( X 
> 1{ Veest = feest 
i=1 


Mest 
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where Mrest denotes the number of test examples. This is an empirical version of 
the ground truth P (Yrest = Y test): 

How to define a fairness-related performance metric? Recall the mathematical 
definition of DI: 


oh mia (PE =O eae) (3.209) 


PY =1|Z = 1) P(Y =1|Z = 0) 


You may wonder how to compute two probabilities of interest: P(Y = 1|Z = 0) 
and P(Y = 1|Z = 1). Using their empirical versions together with the WLLN, 


we can estimate them. For instance, 


PU = 1.2 =0) DEP Mtoe = Laen = 0) 
PZ = 0) Desr EEN a). SE =0} 


P(Y = 1|Z = 0) = 


where the first equality is due to the definition of conditional probability and the 
second approximation comes from the WLLN. The above approximation is getting 
more and more accurate as mest gets larger. Similarly we can approximate the other 
interested probability P(Y = 1|Z = 1). This way, we can evaluate DI (3.209). 


A case study Let us exercise what we have learned with a simple example. As 
a case study, we consider the same setting that we introduced earlier: recidivism 
prediction, wherein the task is to predict if an interested individual reoffends within 
two years, as illustrated in Fig. 3.62. 


Synthesizing an unfair dataset In fair machine learning, we must be cautious 
about unfair datasets. To simplify matters, we will use a synthetic dataset instead of 
a real-world dataset. Although there is a real-world dataset for recidivism prediction 
called COMPAS (Angwin et al., 2020), it contains many attributes, making it more 
complicated. Therefore, we will use a specific yet simple approach to synthesize a 
much simpler unfair dataset. 


recidivism 
predictor 


a } m 
{(c, 29 yO 
Figure 3.62. Predicting a recidivism score y from x = (xj, X2). Here x indicates the number 


of prior criminal records; x2 denotes a criminal type: misdemeanor or felony; and z is a 
race type among white (z = O) and black (z= 1). 


<> 
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Figure 3.63. Visualization of a historically biased dataset: A hollowed (or black-colored- 
solid) circle indicates a data point of an individual with white (or black) race; the red (or 
blue) colored edge denotes y = 1 reoffending (or y = O non-reoffending) label. 


Let us revisit the unfair data scenario visualization that we examined in 
Section 3.18, which will serve as the basis for our synthetic dataset (as explained 
in the sequel). The visualization is shown in Fig. 3.63, where a hollow (or black- 
colored solid) circle represents a data point corresponding to an individual of white 
(or black) race, and the red (or blue) colored edge (ring) denotes the event of the 
interested individual reoffending (or not reoffending) within two years. This sce- 
nario is inherently unfair: for y = 1, there are more black-colored solid circles than 
hollow circles, and conversely for y = 0, there are more hollow circles than solid 
circles. 

To generate such an unfair dataset, we employ a simple method. See Fig. 3.64 for 
illustration of the method. We first generate 7n labels y’s so that they are i.i.d., each 
being according to Bern(3). For indices of positive examples (y = 1), we then 
generate i.i.d. xs according to N((1, 1), 0.571); and iid. 2s as per Bern(0.8), 
meaning that 80% are blacks (z = 1) and 20% are whites (z = 0) among the 
positive individuals. Notice that the generation of x’s is not quite realistic. The 
first and second components in x“) do not precisely capture the number of priors 
and a criminal type. You can view this generation as sort of a crude abstraction 
of the realistic data. On the other hand, for negative examples (y = 0), we 
generate i.i.d. (x, 2)’s with different distributions: x® ~ MN ((—1, —1), 0.571) 
and z® ~ Bern(0.2), meaning that 20% are blacks (z = 1) and 80% are whites 
(z = 0). This way, 2 ~ Bern(3). This is because 


P(Z = 1) 8 PY = DPZ = 1Y = 1) + P(Y = 0)P(Z = IY = 0) 
(6) 1 1 1 
2 084 _05=— 
2 t3 2 
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T2 
Lid. y® ~ Bern(0.5) Ify =1: 
a ~ N((1,1), 0.571) 


Gy — J 1 (black) ,. w.p. 0.8; 
Z = 1 0 (white), w.p. 0.2. 


Tı 
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a ~ M((—1,—1), 0.571) 
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Figure 3.64. A simple way to synthesize an unfair dataset. 
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Figure 3.65. The architecture of the Ml-based fair classifier. 


where (a) follows from the total probability law and the definition of conditional 
probability; and (b) is due to the rule of the data generation method employed. 
Here Z and Y denote generic random variables for z® and y®, respectively. 


Model architecture Fig. 3.65 illustrates the architecture of the MI-based fair 
classifier. Since we focus on the binary sensitive attribute, the discriminator yields 
a single output Do (J). For models of the classifier and discriminator, we employ 
simple single-layer neural networks with logistic activation in the output layer; see 


Fig. 3.66. 


TensorFlow: Synthesizing an unfair dataset First consider the synthesis of 
an unfair dataset. To generate i.i.d Bernoulli random variables for labels, we use: 


import numpy as np 
y_train = np.random.binomial(,0.5,size=(train_size,)) 
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Figure 3.66. Models for (a) the classifier and (6) the discriminator. 


where the first two arguments of (1,0.5) specify Bern(0.5); and the null space fol- 
lowed by train_size indicates a single dimension. Remember we generate i.i.d. 
Gaussian random variables for x®’s. To this end, one can use: 


x = np.random.normal(loc=(1,1),scale=0.5, size=(train_size,2)) 


TensorFlow: Optimizers for classifier & discriminator For classifier, we use 
the Adam optimizer with the learning rate of 0.005 and (£1, 62) = (0.9, 0.999). 
For discriminator, we use another simpler optimizer, named Stochastic Gradient 
Descent, SGD for short. SGD is the naive gradient descent yet with a batch size 
of 1. We use SGD with the learning rate of 0.005. 

from tensorflow.keras.optimizers import Adam 

from tensorflow.keras.optimizers import SGD 


adam=Adam(learning_rate=0.005,beta_1=0.9, beta_2=0.999) 
sgd=SGD(learning_rate=0.005) 


TensorFlow: Classifier (generator) loss Recall the optimization problem of 
interest: 


min max. — pa = Mece(y, Gy (x)) 


-1 Ý bee(2, Da (G(x) 


i=l 
To implement the classifier loss (the objective in the above), we use: 


from tensorflow.keras.losses import BinaryCrossentropy 
CE_loss = BinaryCrossentropy(from_logits=False) 
p_loss = CE_loss(y_pred,y_train) 

f_loss = CE_loss(discriminator(y_pred),z_train) 

c_loss = (l-lamb)*p_loss - lamb*f_loss 


348 Data Science Applications 


where y_pred indicates the classifier output; y train denotes a label; and z train is a 
binary sensitive attribute. 


TensorFlow: Discriminator loss Recall the discriminator loss that we defined 
in (3.208): 


Ag i i 
a a > lce(z! ), Do (Gy! yy). 
i=1 


To implement this, we use: 


f_loss = CE_loss(discriminator(y_pred),z_train) 
d_loss = lamb*f_loss 


TensorFlow: Evaluation Recall the DI performance: 


ae P(Y = 1|Z =0) P(Y =1|Z =1) 
E PY =1/Z=1) P(Y =1|Z=0)/) 


To evaluate the DI performance, we rely on the following approximation: 


La 192, = =], ae, = 0} 
Parie =o} 


P(Y = 1|Z =0) » 


Here is how to implement this in detail: 


import numpy as np 

y_tilde = (y_pred>0.5).intQ.squeezeQ 

zO_ind = (z_train == 0.0) 

Z1_ind = (z_train == 1.0) 

zO_sum = int(np.sum(zO_ind)) 

zl_sum = int(np.sum(z1_ind)) 

P_y1_zO = float(np.sum((y_tilde==1)[ZO_ind]))/zO_sum 
P_y1_z1 = float(np.sum((y_tilde==1)[z1_ind]))/z1_sum 


Closing In Part I, we have explored three crucial notions in information theory, 
namely entropy, mutual information, and KL divergence. These notions play a sig- 
nificant role in representing the fundamental limit on the compression rate of an 
information source and proving the associated theorem, source coding theorem. 
Part II of the book has focused on investigating the fascinating phenomenon of 
the maximum transmission rate, phase transition, and highlighting the elegance of 
information theory, which deals with the laws governing the flow of information, 
much like physics deals with the laws governing the behavior of the physical uni- 
verse. In addition, we have discovered the critical role of mutual information in 
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defining the sharp threshold on the maximum transmission rate, known as channel 
capacity, as established in the channel coding theorem. 

In Part III, we have showcased the modern applications of information theory 
in data science, emphasizing two main storylines. The first storyline focuses on 
the information theory of various systems that are of interest in data science, such 
as social networks, biological networks, and ranking systems. In this context, we 
have observed the occurrence of a phase transition in the amount of information 
required to perform various tasks, including community detection in social net- 
works, Haplotype phasing in computational biology, and top-K ranking in search 
engines. To prove the achievability and converse of information-theoretic limits, 
we have utilized several powerful tools of information theory, including the union 
bound, MAP decoding, maximum likelihood decoding, Chernoff bound, Fano’s 
inequality, and data processing inequality. The second storyline deals with the roles 
of information-theoretic notions in machine learning and deep learning. Specifi- 
cally, we have explored the core role of cross entropy in designing a loss function 
for supervised learning, the fundamental role of KL divergence in the design of a 
powerful unsupervised learning framework known as GAN, and the recently dis- 
covered role of mutual information in the development of fair machine learning 
algorithms. 

The topics discussed in this book encompass a range of classical and modern 
concepts in information theory. However, we acknowledge that there are still many 
other topics that we have not covered. Our approach has been to emphasize the 
development of logical and critical thinking skills, which we believe is more impor- 
tant than simply covering a wide range of topics. Information theory provides pow- 
erful principles and tools that have been successfully applied in various fields by 
many researchers. Although this book focuses on applications in data science, we 
believe that the principles discussed here have much broader applicability. We hope 
that you will find these principles and tools useful for your own purposes. 
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Problem Set 12 


Prob 12.1 (Equalized Odds) In Section 3.18, we studied two fairness con- 
cepts: (i) disparate treatment; and (ii) disparate impact. In this problem, we explore 
another fairness notion that arises in the field: Equalized Odds (EO for short). Let 
Z e Z bea sensitive attribute. Let Y and Y be the ground-truth label and its 


prediction. 


(a) 


(b) 


(c) 


For illustrative purpose, consider a simple setting where Z and Y are binary. 
Let Y = 1{Y > 0.5}. The mathematical definition of the EO under this 
setting is: 


P(Y =1|Y=y,Z=1- 
EO := min min ( | J a 


5 (3.210) 
Je{0,1} 2€(0,1} P(Y = 1|Y =y,Z =z) 


Show that 1(Z; Y|Y) = 0 implies EO = 1. 
Suppose that Z and Y are not necessarily binary. The relationship between 
conditional mutual information and the KL divergence is: 


I(Z; ÎIY) = KLP zy PhyPzy) 


where Py ay Piy and Pzjy indicate the conditional probability of 


(Y,Z), Y, and Z, respectively, conditioned on Y. Using this definition, 
show that 


Py zy (J, z) 


———_ +H (Z|Y 
PhO) (ai 


IZYY= > PizrOepleg 
yEV JED 2EZ 
(3.211) 


where Py z y indicates the joint distribution of (Ê, Z, Y); and Py Zy and 
Pry denote the conditional distributions of (Ê ,Z) and a respectively, 
conditioned on Y = y. 
Show that 

I(Z;Y|Y) = H(Z|Y)+ 


max > Po > 0, z, y) log Diy, z, y). 
DG29): zez DGZzN=1, a iad 
YEY yEV,2EZ 


(3.212) 
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(d) Explain the rationale behind the following approximation: 


I(Z;Y|Y) ~ H(ZIY) 


ee = Jog DG, 2), 9), 
D(z): Z zez D(jzy)=1 > m 


(3.213) 


(e) Formulate an optimization for a fair classifier that attempts to minimize 
both prediction accuracy and the approximated /(Z; Y|Y) (3.213). Use 
a notation A for a regularization factor that balances prediction accu- 
racy against the quantified fairness constraint. Also draw the classifier-&- 
discriminator architecture which represents the formulated optimization. 


Prob 12.2 (A variant of the Ml-based fair classifier) Let Z e {0,1} bea 
binary sensitive attribute. Let Y and Y be the ground-truth label and its prediction 
of a classifier. Let Y = 1{Y > 0.5}. 


(a) Show that /(Z; Y) = O isa necessary and sufficient condition for DI = 1 
where 


o min PW HZ =) PW = 112 =1) 
= DY =1|\Z = 1) PË =1|Z=0) J 


(b) Approximate (Z; Y) similarly to the formula claimed in part (d) of 
Prob 12.1. Also explain the rationale behind the approximation. 

(c) Formulate an optimization for a fair classifier that attempts to minimize 
both prediction accuracy and the approximated /(Z; Ý), derived in the 
prior part. Use a notation A for a regularization factor that balances pre- 
diction accuracy against the fairness constraint. Also draw the classifier-&- 
discriminator architecture which respects the formulated optimization. 


Prob 12.3 (TensorFlow implementation of the Ml-based fair classifier) 
Consider the MI-based fair classifier in Sections 3.19 and 3.20. In this problem, you 
are asked to build a simple fair classifier that predicts recidivism scores of individ- 
uals with prior criminal records. See Fig. 3.67. We employ very simple single-layer 
neural networks for classifier (generator) and discriminator with logistic activation 
at the output layer. 


(a) (Unfair dataset synthesis) Explain how an unfair dataset is generated in the 
following code: 
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Recidivism 
predictor 


t 
f(a, ZO, y) m 


Figure 3.67. Predicting a recidivism score y from x = (x1, X2). Here x indicates the number 
of prior criminal records; x2 denotes a criminal type: misdemeanor or felony; and z is a 
race type among white (z = O) and black (z = 1). 


t> 


(it) =T 


import numpy as np 

n_samples = 2000 

p=0.8 

# numbers of positive and negative examples 

n_Y1 = int(n_samples*0.5) 

n_YO = n_samples - n_Y1 

# generate positive samples 

Y1 = np.ones(n_Y1) 

X1 = np.random.normal(loc=[1,1],scale=0.5, 
size=(n_Y1,2)) 

Z1 = np.random.binomial(,p,size=(n_Y1,)) 

# generate negative samples 

YO = np.zeros(n_YO) 

XO = np.random.normal(loc=[-1,-1],scale=0.5, 
size=(n_YO,2)) 

ZO = np.random.binomial(i,1-p,size=(n_YO,)) 

# merge 

Y = np.concatenate((Y1,YO)) 

X = np.concatenate((X1,X0)) 

Z = np.concatenate((Z1,Z0)) 

Y = Y.astype(np.float32) 

X = X.astype(np.float32) 

Z = Z.astype(np.float32) 

# shuffle and split into train & test data 

shuffle = np.random.permutation(n_samples) 

X_train = X[shuffle][:int(n_samples*0.8)] 

Y_train = Y[shuffle][:int(n_samples*0.8)] 

Z_train = Z[shuffle][:int(n_samples*0.8)] 

X_test = X[shuffle][int(n_samples*0O.8):] 

Y_test = X[shuffle][int(n_samples*0.8):] 

Z_test = X[shuffle][int(n_samples*0.8):] 
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(b) (Data visualization) Using the following code or otherwise, plot randomly 
sampled data points (say 200 random points) among the entire data points 
generated in part (a). 


import matplotlib.pyplot as plt 
# randomly select the number n_s of samples 
n_s = 200 
Xs = X_train[:n_s] 
Ys = Y_train[:n_s] 
Zs = Z_train[:n_s] 
# choose part of X and Y assiciated with a certain Z 
X_ZO = Xs[Zs==0.0] 
X_Z1 = Xs[Zs==1.0] 
Y_ZO = Ys[Zs==0.0] 
Y_Z1 = Ys[Zs==1.0] 
# plot 
plt.figure(figsize=(14,10)) 
plt.scatter( 
X_ZOLY_ZO==1.0][:,0], X_ZO[Y_ZO==1.0][:,1], 
color='red’,marker=’0’,facecolors=’none’, 
s=120, linewidth=1.5, label=’White reoffend’) 
plt.scatter( 
X_ZO[Y_ZO==0.0][:,0], X_ZO[Y_ZO==0.0][:,1], 
color=’blue’,marker=’o’,facecolors=’none’, 
s=120, linewidth=1.5, label=’White non-reoffend’) 
plt.scatter( 
X_ZI[Y_Z1==1.0][:,0], X_Z1[Y_Z1==1.0][:,1], 
color=’red’,marker=’0’,facecolors=’black’, 
s=120, linewidth=1.5, label=’Black reoffend’) 
plt.scatter( 
X_ZI[Y_Z1==0.0][:,0], X_ZI[Y_Z1==0.0][:,1], 
color=’blue’,marker=’0’,facecolors=’black’, 
s=120, linewidth=1.5, label=’Black non-reoffend’) 
plt.legend(fontsize=16) 


(c) (Classifier & discriminator) Draw block diagrams of the classifier and the 
discriminator implemented by the following code: 


from tensorflow.keras.models import Sequential 
from tensorflow.keras.layers import Dense 


classifier=SequentialQ 

classifier.add(Dense(],input_dim=2, activation=’sigmoid’)) 
discriminator=Sequentiald 
discriminator.add(Dense(1,inoput_dim=1, activation=’sigmoid’)) 
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(d) (Optimizers and loss functions) Explain how the optimizers and loss func- 
tions of the discriminator and the classifier are implemented in the follow- 
ing code. Also draw a block diagram of the GAN model implemented as 
the name of gan. 


from tensorflow.keras.layers import Input 

from tensorflow.keras.models import Model 

from tensorflow.keras.optimizers import Adam 

from tensorflow.keras.optimizers import SGD 

from tensorflow.keras.losses import BinaryCrossentropy 
from tensorflow.keras.layers import Concatenate 


# optimizers of classifier & discriminator 
c_opt=Adam(learning_rate=0.005,beta_1=0.9,beta_2=0.999) 
d_opt=SGD(learning_rate=0.005) 


# define dicriminator loss 
def d_loss(y_true,y_pred): 
CE_loss = BinaryCrossentropy(from_logits=False) 
lamb = 01 
return lamb*CE_loss(y_pred,y_true) 
# discriminator compile 
discriminator.compile(loss=d_loss, optimizer=d_opt) 


# define classifier (generator) loss 
def c_loss(y_true,y_pred): 
# y_true[:,O]: Y_train (label) 
# y_true[:1]: Z_train (sensitive attribute) 
# y_pred[:,O]: classifier output G(x) 
# y_pred[:,1]: discriminator output fed by 


# classifier output D(G(x)) 
CE_loss = BinaryCrossentropy(from_logits=False) 
lamb = 01 


p_loss = CE_loss(y_predf[:,O],y_true[:,0]) 
f_loss = CE_loss(y_pred[:,1],y_true[:,1]) 
return (l-lamb)*p_loss - lamb*f_loss 


# define the GAN model 

# input: x 

# output: [GOX), D(GO))] 
discriminator.trainable = False 
gan_input = Input(shape=(2,)) 

Gx = classifierCinouts=gan_input) 
DGx = discriminator(Gx) 

output = ConcatenateQ([Gx,DGx]) 
gan = Model(gan_input, output) 
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# The GAN model compile 
gan.compile(loss=c_loss, optimizer=c_opt) 


(e) (Training) Explain how classifier and discriminator are trained in the 
following code: 


import numpy as np 


EPOCHS = 400 

k=2 # k:1 alternating gradient descent 
c_losses = [] 

d_losses = [] 


for epoch in range(1,EPOCHS+1): 
HHHHHHHHHHHHHHRAAAAAA 
# train discriminator 
HHHHHHHHHHHHHHHHRARAAAHA 
# input for discriminator 
d_input=classifier.predict(X_train) 
# label for discriminator 
d_label=Z_train 
# train discriminator 
d_loss=discriminator.train_on_batch(d_input,d_label) 


HHHHHHHHHHHHHHRAAAAAHA 
# train Classifier 
HHHHHHHHHHHHHHREAAAAHA 
if epoch % k == O: # train once every k steps 
# label for classifier 
# Ist component: Y_train 
# 2nd component: Z_train (sensitive attribute) 
c_label = np.zeros((lenCY_train),2)) 
c_label[:,0] = Y_train 
c_labelf[:,1] = Z_train 
# train Classifier 
c_loss = gan.train_on_batch(X_train,c_label) 


c_losses.append(c_loss) 
d_losses.append(d_loss) 


(f) (Evaluation) Suppose we train classifier and discriminator using the code 
in part (e) with EPOCHS=400. Plot the tradeoff performance between test 
accuracy and DI by sweeping / from 0 to 1. Also include the Python script. 


DOE: 10.1561/9781638281153.ch4 
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Python Basics 


A.1 Jupyter Notebook 


Outline To use Python, you will need to have another software platform, known 
as Jupyter notebook, installed on your system. In this section, we will cover some 
basic concepts related to Jupyter notebook. We will cover four parts in detail. 
First, we will explore the role of Jupyter notebook in light of Python. Next, we 
will provide guidance on how to install the software and launch a file for scripting 
a code. We will also examine some useful interfaces that simplify the process of 
scripting Python code. Finally, we will introduce several frequently-used shortcuts 
for writing and executing code. 


What is Jupyter notebook? Jupyter notebook is a powerful tool that allows 
you to write and run Python code. One of its key advantages is that you can execute 
each line of code individually, rather than running the entire code all at once. This 
feature makes it easy to debug your code, particularly when dealing with lengthy 
programs. 

a= 

b=2 

a+b 

3 
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Anaconda Installers 


Windows ia MacOS é Linux & 


64-Bit Graphical Installer (457 MB) 64-Bit Graphical Installer (435 MB) 64-Bit (x86) Installer (529 MB) 
32-Bit Graphical Installer (403 MB) 64-Bit Command Line Installer (428 MB) 64-Bit (Power8 and Power) Installer (279 
MB) 


Figure A.1. Three versions of Anaconda installers. 


There are two common methods for using Jupyter notebook. The first involves 
running the code on a server or in the cloud, while the second involves using a local 
machine. In this section, we will focus on the latter approach. 


Install & launch To use Jupyter notebook on a local machine, you need to install 
a software tool called Anaconda. The latest version can be downloaded from 


https://www.anaconda.com/products/individual 


There are three different versions available for different operating systems, as 
shown in Fig. A.1. During installation, you may encounter errors related to non- 
ASCII characters in the destination folder path or permission to access the path. 
To resolve these issues, ensure that the folder path does not include non-ASCII 
characters and run the installer under “run as administrator” mode. 

To launch Jupyter notebook, you can use the Anaconda prompt for Win- 
dows or the terminal for Mac and Linux. Simply type “jupyter notebook” in the 
prompt and press Enter. The Jupyter notebook window will open automatically. 
If it does not appear, you can manually open it by copying and pasting the URL 
indicated by the arrow in Fig. A.2 into your web browser. Once properly launched, 
the Jupyter notebook window should look like Fig. A.3. 

Generating a new notebook file is an easy process. Initially, navigate to the folder 
where you wish to save the notebook file. Then, select the New tab on the top right 
corner (highlighted in blue), and click on the Python 3 tab (indicated in red). Refer 
to Fig. A.4 to locate the tabs. 
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MB Anaconda Powershell Prompt (Anaconda3) ‘es o x 


running 


Figure A.2. How to launch Jupyter notebook in the Anaconda prompt. 


= jupyter Quit Logout 
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O © Documents 3 hours ago 
O D Downloads 4 months ago 
O D Dropbox a day ago 
D D Favorites 3 months ago 
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Figure A.3. Web browser of a successfully launched Jupyter notebook. 
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Figure A.4. How to create a Jupyter notebook file on the web browser. 


Interface Jupyter notebook contains two key components required to run a 
code. The first is a computational engine which executes the code. The engine 
is named Kernel and it can be controlled via several functions in the Kernel tab. 
See Fig. A.5 for details. 

The second element is a component called a “cell,” where you can write a script. 
The cell has two modes of operation: edit mode and command mode. In edit mode, 
you can type a code script for running a program or any text like a regular text editor. 
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T ju pyter Untitled Last Checkpoint: an hour ago (unsaved changes) 


File Edit View Insert Cell Kemel Widgets Help 


+ x ® Bo ® +) Pp Rug Interrupt [1] [==] 
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Restart & Clear Output 
In [15]: import math Restart & Run All 
import matplotlib.p§ Reconnect 


import numpy as np tae 


In [6]: print("The basic st 
Change kernel 
The basic structure -= = 


Figure A.5. Kernel is a computational engine which serves to run the code. There are 
several relevant functions under the Kernel tap. 
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Figure A.6. How to choose the Code or Markdown option in the edit mode. 


Code scripts are written under the Code tab, indicated by a red box in Fig. A.6, 
while text-editing is done under the Markdown tab, indicated by a blue box. In 
command mode, you can edit the notebook as a whole. This allows you to copy or 
delete cells, and move them around. 


Shortcuts There are numerous shortcuts that are very useful for editing and navi- 
gating a Jupyter notebook. We will highlight three types of shortcuts that are com- 
monly used. The first set is for changing between the edit and command modes. To 
switch from the edit to the command mode, we press the Esc key, while pressing 
Enter takes us back to the edit mode. The second set of shortcuts is for inserting or 
deleting a cell. Under the command mode, we can use the “a” shortcut to insert a 
new cell above the current cell, “b” to insert below, and “d+d” to delete the current 
cell. The final set of shortcuts is for executing a cell. We can use the arrow keys to 
move between cells, Shift + Enter to run the current cell and move to the next one, 
and Ctrl + Enter to stay in the current cell after execution. 
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A.2 Basic Syntaxes of Python 


Outline This section will cover the basic Python syntax required to script for 
information-theoretic notions and algorithm implementation. Specifically, we will 
focus on three essential concepts: (i) class; (ii) package; and (iii) function. Addi- 
tionally, we will introduce a range of Python packages that are relevant and useful 
for the topics covered in this book. 


A.2.1 Data structure 


There are two prominent data-structure components in Python: (i) list; and (ii) 
set. 


(i) List List is a data type that is built-in in Python, which enables the storage of 
multiple elements in a single variable. The elements are listed in a specific order, 
and duplicates are allowed. Below are examples of how to use: 


x = [1, 2, 3, 4] # construct a simple list 
print(x) 


[1, 2, 3, 4] 


x.append(5) #add an item at the end 
print(x) 


[1, 2, 3, 4, 5] 


x.popQ) # delete an item located in the last 
print(x) 


[1, 2, 3, 4] 


# checking if a particular element exists in the list 
if 3 in x: 
print(True) 
if 5 in x: 
print(True) 
else: 
print(False) 


True 
False 


# A single-line construction of a list 
y = [x for x in range(1,10)] 
printcy) 


[1, 2, 3, 4, 5, 6, 7, 8, 9] 
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# Retrieving all the elements through a “for” loop 
for iin x: 
printdi) 


KRWN 


(ii) Set Set is a built-in data type that has similarities with List, but with two key 
differences. Firstly, it is an unordered data type, and secondly, it does not allow for 
duplicate elements. Below are some examples of how to use. 


x = set({1, 2, 3}) # construct a set 
print(f"x: {x}, type of x: {type (x)}”) 


x: {1, 2, 3}, type of x: <class ’set’> 


The f in front of strings in the print command tells Python to look at the values 
inside {-}. 

x.add(1) # add an existing item 

print(x) 


{1, 2, 3} 


x.add(4) #add anew item 
print(x) 


{1, 2, 3, 4} 


# checking if a particular element exists in the list 
if lin x: 
print(True) 
if SinDE 
print(True) 
else: 
print(False) 


True 
False 


# Retrieving all the elements through a “for” loop 
for iin x: 
printdi) 


KRWN 
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A.2.2 Package 


Let us present five packages that are essential for writing codes for the problems 
dealt in this book: (i) math; (ii) random; (iii) itertools; (iv) numpy; and (v) scipy. 


(i) math The math module offers a range of useful mathematical expressions, 
such as exponential, logarithmic, square root, and power functions. Below are some 
examples to illustrate their usage. 


import math 
math.exp(]) # exp) 
2.718281828459045 


print(math.log(1, 10)) # /og(x, base) 
print(math.log(math.exp(20))) # natural logarithm 
print(math.log2(4)) # base-2 logarithm 
print(math.log10(1000)) # base-10 lograithm 


0.0 
20.0 
2.0 
3.0 


print@math.sqrt(16)) # square root 
print(math.pow(2,4)) # x raised to y (same as x**y) 
print(2**4) 


4.0 
16.0 
16 


printCmath.cos(math.pi)) # cosine of x radians 
printC(math.dist({1,2],[3,4])) # Euclidean distance 


-1.0 
2.8284271247461903 


# The erfO function can be used to compute traditional 
# statistical functions such as the CDF of 
# the standard Gaussian distribution 
def phi(x): 
# CDF of the standard Gaussian distribution 
return (1.0 + math.erf(x/math.sqrt(2.0)))/2.0 


phic) 


0.8413447460685428 
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(ii) random This module yields random number generation. See below for some 
examples. 


import random 


random.randrange(start=1, stop=10, step=1) 
# a random number in rangecstart, stop, step) 
random.randrange(10) # integer from O to 9 inclusive 


5 


# returns random integer n such that a<=n<=b 
random.randint(i, 10) 


7 


(iii) itertools This package offers a concise method to explore all possible cases 
in various combinatorial situations. 


from itertools import permutations, combinations 


# generating all permutations of [1, 2, 3] 
p = permutations([], 2, 3]) 


for iin p: 
print Ci) 


C1, 2, 3) 
C1, 3, 2) 
(2, 1, 3) 
(2, 3, 1) 
(3, 1, 2) 
(3, 2, 1) 


# generating all length-2 combinations of [1, 2, 3] 
c = combinations([], 2, 3], 2) 


for iine: 
print Ci) 


1, 2) 
í, 3) 
Q, 3) 


# generating all length-3 combinations of [1, 2, 3, 4, 5] 
c = combinations([1, 2, 3, 4, 5], 3) 
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foriinc: 
print Ci) 


C1, 2, 3) 
C1, 2, 4) 
C1, 2, 5) 
(A, 3, 4) 
C1, 3, 5) 
C1, 4, 5) 
(2, 3, 4) 
(2, 3, 5) 
(2, 4, 5) 
(3, 4, 5) 


(iv) numpy Numpy is a widely used package for manipulating matrices and vec- 


tors. It provides numerous helpful functions, some of which are commonly utilized 
and listed below. 


(a) numpy.array() numpy.array() isa specialized array data structure in numpy. 
This differs from Python data type array(). 


import numpy as np 

np.array([1, 2, 3]) # construct an array 
array([], 2, 3]) 

np.array([[1, 2], [3, 4]]) # construct a 2D array 


array({[1, 2], 
[3, 4]]) 


x = np.ones((2,2)) 

# construct an all-one matrix with size of 2-by-2 

x = np.zeros((2,2)) 

# construct an all-zero matrix with size of 2-by-2 
print(np.ones_like(x)) 

# all-one matrix with the same shape and type of input 
print(np.zeros_like(x)) 

# all-zero matrix with the same shape and type of input 


[C1. 1.] 
Lle td 
[[O. O.] 
[O. 0.] 
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# range of x 
x_grid=np.arange(O,1,0.0001) 

# or one can use: 
x_grid2=np.linspace(O,1,0.0001) 


# concatenation of two numpy arrays 

xl = np.array([1,2]) 

x2 = np.array([3,4]) 

xc = np.concatenate((x1,x2)) # co/umn-wise 
xr = np.vstack((x1,x2)) # row-wise 

print(xc) 

print(xr) 


l 


[1 2 3 4] 
[C1 2] 
[3 4]] 


# sign function 

x = np.array([1.2,-3,2,-4.2]) 
s = np.sign(x) 

printcs) 


[1.-1. 1. -1.] 


(b) numpy.random() The purpose of this module is to generate random sam- 
ples from different probability distributions. We provide some commonly used 
examples below, but for more information, you may want to refer to: 


https://numpy.org/doc/1.16/reference/routines.random.html 


# sampling a number from standard Gaussian distribution 
np.random.normal(loc = O, scale = 1) 

# loc: mean, scale: standard deviation 

np.random.randn() # plays the same role 


-2.5459976698222495 


# sampling multiple numbers as per the standard Gaussian 
np.random.normal(O, 1, size = (2, 2)) 

# Here the size determines the output shape 
np.random.randn(2,2) # plays the same role 


array({[-1.8133258 , -1.01151295], 
[-0.37375747, 0.36005748]]) 


np.random.rand(2,2) # Uniform over [0,1] 
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array([L0.06535694, 0.2507505 J], 
[0.17559137, 0.60967901]]) 


# Uniform over [0.8,1] 
np.random.uniform(0.8,1,(2,2)) 


array ([[0.89902277, 0.85310313], 
[0.96578371, 0.85695091]]) 


# Binomial distribution 
np.random.binomial(IOO0O0,0.5) # 70000 trials of Bern(0.5) 


5042 


(c) numpy.linalg Here are some of the useful linear-algebra related functions 
offered by this package. 


from numpy import linalg 


x = np.random.randn(2,2) 

printClinalg.det(x)) # Determinant of a matrix x 
printClinalg.inv(x)) # Inverse of a matrix x 
printClinalg.norm(x)) # Matrix or vector norm 
printClinalg.svd(x)) # Singular value decomposition 
printClinalg.eig(x)) # Eigenvalue decomposition 


0.7125655927348966 

[[ 0.77007826 -0.38835738] 

[ 2.33455331 0.64504946]] 
1.832010151997132 
Carray([[-0.2060815, 0.97853483], 

[ 0.97853483, 0.2060815 ]]), 
array ([1.78814528, 0.39849424)), 
array ([L-0.96330981, 0.2683919 ], 

[ 0.2683919 , 0.96330981]])) 
Carray([0.50418566+0.67702467j , 0.50418566-0.67702467j ]), 
array ([[0.02479485-0.37684352j , 0.02479485+0.37684352j ], 

[0.92594502+0. j , 0.92594502-0. j ]1)) 


(d) numpy.fft One of the useful operations in communication and signal pro- 
cessing is Discrete Fourier Transform (DFT). Let x[7]’s be time-domain discrete 
signals where m € {0,1,..., N — 1}. Then, the corresponding frequency-domain 
signals read: 


N-1 
1 ae 
X[k] = SD sime IEE ke (0,1... 1). 
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One can implement this using a built-in function fft in numpy.fft. 


N-1 


fft(x) = +2 x[mle IN”, 


m=0 


In order to align with the specified DFT, it is necessary to divide the fft by VV. 
Conversely, the inverse function ifft serves the opposite function and can be utilized 
in a similar manner. 


1 N-1 é 
O = = > Xk, 
k=0 


from numpy.fft import fft 
from numpy.fft import ifft 


x_time = np.random.randn(8s) 
X_freq = fft(x_time)/np.saqrt(8) 
x_time_rec = np.sqrt(8)*ifftCX_freq) 
print(x_time) 

print(x_time_rec) 


[ 0.19398987 0.92755053 1.14652418 1.05737049 -0.66500356 
0.43650243 1.04576987 -0.95167376] 

[ 0.19398987+0.j 0.92755053+0.j 1.14652418+0.j 1.05737049+0. j 
-0.66500356+0.j 0.43650243+0.j 1.04576987+0.j -0.95167376+0.j ] 


(e) resizing The resizing is used for transforming the dimension of one into 
another. 

x = np.random.randn(4,4,1) 

y = x.view(dtype=np.float_).reshape(-1,2) 

# ’-1’ can be inferred from the context: Shape of (8,2) 

print(y) 

Z = x.squeeze() 

print(z.shape) 


[[-0.85719316 2.99692221] 
[ 1.16327996 -0.11955541] 
[-0.76229609 0.79871494] 
[ 0.99757568 0.69329723] 
[-1.52198295 -0.74430996] 
[ 0.17174063 0.25343301] 
[ 0.07151011 -2.90945412] 
[ 1.1874155 -0.64209109]] 
(4, 4) 
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(v) scipy This particular module offers an extensive collection of probability dis- 
tributions and corresponding statistical metrics. Presented below are a few exam- 
ples; for additional details, please refer to: 


https://docs.scipy.org/doc/scipy/reference/stats.html 


from scipy import stats 


# A random variable with the standard Gaussian 
X = stats.norm(loc = O, scale = 1) 

# loc:mean, scale:standard deviation 
print(X.cdf(np.array([-1, O, 1]))) 

# computes the CDF at each numpy array 
print(X.rvs(size = 3)) 

# generating a sequence of random variables 


[0.15865525 0.5 0.84134475] 
[ 0.39460402 -0.8042592 -0.71404882] 


# Another random variable with the uniform distribution 
Y = stats.uniform(loc = O, scale = 1) 

# uniform distribution in [loc, loc + scale] 
printCY.cdf(np.array([-1, O, 0.5, 1]))) 

printCY.rvs(size = 3)) 


[O. O. O05 1. J 
[0.72953474 0.67879248 0.47947748] 

For binary random variables, we employ a built-in function, bernoulli in 
scipy.stats. 


from scipy.stats import bernoulli 
X = bernoulli¢O.5) 

X_samples = X.rvs(10) 
print(X_samples) 


[0100100100] 


It also contains built-in functions for entropy and the KL divergence. 


from scipy.stats import entropy 


pX1 = np.array([1/2, 1/2]) # numpy.array 
PX2 = 1/2 / 2st 
printCentropy(pX1, base=2)) 
printCentropy(pX2, base=2)) 
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1.0 
1.0 


Here the input distribution can take either a numpy.array or a list. 


from scipy.special import rel_entr 


# Compute py) 

pXY = np.array([1/4, 1/4, 1/3, 1/6]) 

# Compute poopy) 

pXpY = np.array([pXLO]*pYLO],exLO]*pYT1], 
PX[1]*pYLO]pXxii]*pYLi]) 

kl_builtin = rel_entr(pXY,pXpyY) 

print(sum(kl_builtin)) 

# To convert into log base 2 

print(sum(kl_builtin)/np.log(2)) 


0.014362591564146779 
0.020720839623908218 


To calculate the KL divergence, scipy.special provides the function rel_entr. 
rel_entr uses natural logarithms instead of log base 2 and produces a list of values 


in the form of p(x) In aa z 5 . Therefore, proper conversion is required. 
In communication A it is common to compute the Q-function which 


is defined as: 


Q(a) := 5 a dz. 


o9 1 
—— e 
J V 20 
The analysis of communication error probability is aided by this process. The 
numerical computation of the required integration can be performed through the 
implementation of the erfc command provided by scipy.special. 


erfc(x) =| 4 meh dt. 


The relation between Q(a) and — is: 


where (a) comes from the change of variable ¢ := RA (dz = J/2dt). 
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Figure A.7. Plotting a simple function via matplotlib.pyplot. 
from scipy.special import erfc 
a=10 


Qfunc = 1/2*erfc(a/np.sqrt(a/2)) 
print(Qfunc) 


1.2698142947354283e-10 


A.2.3 Visualization 


matplotlib.pyplot is the most commonly used function for graph plotting. The 
following is a guide on how to utilize it. 
import matplotlib.pyplot as plt 


x_value = [x for x in range(10)] 
y_value = [y for y in range(10, 20)] 


plt.figure(figsize=(4,4),dpi=150) # figure size and resolution 
plt.plot(x_value, y_value, color=blue’, label=’line’) 


plt.xlabel(’ x’) # labeling x-axis 
plt.ylabel(y’) # labeling y-axis 
plt.title? sample curve’) 

plt.legendQ 


plt.show( # No need to use show( in jupyter notebook. 
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two sample curves 


Figure A.8. Multiple functions and legend. 


It is also possible to plot multiple curves in a single graph. 


# we can plot multiple graphs at once 


x = [x for x in range(10)] 
y_1= [3*y for y in range(10)] 
y_2 = [2*y for y in range(10)] 


plt.figure(figsize=(4,4), dpi=150) 

plt.plot(x, y_1, color=’blue’, label=’y=3x’) # p/ot_/ 
plt.plot(x, y_2, color=’red’, label=’y=2x’) # plot 2 
plt.xlabel(x’) # labeling x-axis 

plt.ylabelCy’) # labeling y-axis 

plt.title” two sample curves’) 

plt.legendQ 

plt.showQ 


To draw the probability distribution of a random variable, we often employ a 
stats visualization package, named seaborn. 
import seaborn as sns 


import matplotlib.pyplot as plt 
from scipy.stats import bernoulli 


X = bernoulli(O.5) 
X_samples = X.rvs(1000) 


372 Python Basics 


plt.figure(figsize=(4,4), dpi=150) 
sns.histplot(X_samples) 

plit.xlabel( Values of a random variable’) 
plt.ylabel@ Histogram’) 

plt.showQ 


In communication problems, it is common to plot the probability of error, 
which is often exceedingly small, such as 1075. To better differentiate between 
small probability values, a logarithmic scale of error probability is utilized. This 
can be achieved by employing the function plt.yscale(‘log’). 


import numpy as np 
from scipy.special import erfc 
import matplotlib.pyplot as plt 


SNRGB = np.arange(O,21,1) 
SNR = 10**(SNRdB/10) 


# Q-function 
Qfunc = 1/2*erfc(np.sqrt(SNR/2)) 


plt.figure(figsize=(4,4), dpi=150) 
plt.plot(SNRdB, Qfunc, label=’Q(sqrt(SNR))’) 
plit.yscaleClog’) 

plt.xlabelCSNR (dB)’) 

plt.grid(linestyle=":’, linewidth=0.5) 
plt.titleCQ function’) 

plt.legendQ 

plt.showQ 
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Figure A.9. Plotting a histogram of independent realizations of a random variable. 
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Q function 
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Figure A.10. Logarithmic scale of the Q-function as a function of SNR. 


tick size = 10 tick size = 20 
1.04 1.0; 
0.84 0.8} 
0.64 0.6; 
044 0.44 
0.2 4 0.2; 
0.04 0.0; 
e ie el 50:0 0.5 1.0 


Figure A.11. How to adjust the font size of axis tick values. 


To modify the font size of axis tick values, the matplotlib.rc command is utilized. 
The following code serves as an example: 


# To adjust the font size of axis tick values 
import matplotlib 

import matplotlib.pyplot as plt 

import numpy as np 


x= np.linspace(O,1,100) 
y=np.sart(x) 
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plt.figure(figsize=(10,5),dpi=200) 
matplotlib.rcCxtick’, labelsize=10) 
matplotlib.rcC ytick’,labelsize=10) 
plt.subplotd,2,1) 

plt.plot(x,y) 

plt.titleC tick size = 10’) 
matplotlib.rcCxtick’,labelsize=20) 
matplotlib.rcCytick’,labelsize=20) 
plt.subplotd,2,2) 

plt.plot(x,y) 

plt.titleCtick size = 20’) 
plt.showQ 


DOE: 10.1561/9781638281153.ch5 


Appendix B 


TensorFlow and Keras Basics 


Outline Part III covered several applications in data science, such as machine 
learning and deep learning. Deep learning, a learning approach that utilizes a deep 
neural network (DNN) as a basic model for predictions, can be implemented using 
various software tools known as machine learning frameworks or application pro- 
gramming interfaces (APIs). TensorFlow, Keras, Pytorch, DL4J, Caffe, and mxnet 
are some examples of such frameworks. Each framework has its advantages and 
disadvantages, depending on the requirements of the deep learning model design, 
such as usability, training speed, functionality, and scalability in distributed train- 
ing. This book prioritizes usability and therefore focuses on the high-level API with 
fast user experimentation, which is Keras. 

The Keras API facilitates moving from idea to implementation with minimal 
steps, making it an ideal choice for this book. This appendix presents four basic 
contents related to Keras. Since Keras is fully integrated with TensorFlow, it comes 
packaged with the TensorFlow installation. In the first part, we will learn how to 
install TensorFlow. To implement deep learning, three key procedures are required: 
(i) data preparation and processing; (ii) neural network model building; and (iii) 
model training and testing. The second part will cover an easy way to handle data 
using Keras, followed by building a neural network model using popular packages 
such as keras.models and keras.layers. Finally, we will explore how to train and 
test a model accordingly. To illustrate these procedures easily, we will demonstrate 
them using a simple example. 
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Installation Installing Keras requires the installation of TensorFlow. Fortunately, 
the installation process is straightforward: 


pip install tensorflow 


Keras is fully supported by TensorFlow 2 packages. To ensure a proper 
installation, a pip version higher than 19.0 (or higher than 20.3 for macOS) 
is required. You may need to upgrade pip by running the command: 
"pip install -upgrade pip". To confirm a successful installation, try importing 
keras using the following command: 


from tensorflow import keras 


If there are no errors, then you are ready to start using Keras. However, if you do 
encounter any errors, you may want to refer to the installation guidelines found at: 


https://www.tensorflow.org/install 


A simple task We will be focusing on a simple task of classifying handwritten 
digits, where the objective is to identify a digit from an image of handwritten digits. 
An example of such an image is shown in Fig. B.1. The figure demonstrates a case 
where an image of the digit 2 is accurately identified. 


Preparing and processing data The digit classification task is commonly asso- 
ciated with the MNIST (Modified National Institute of Standards and Technology) 
dataset, which contains 60, 000 training images and 10, 000 testing images. Each 
image, denoted by x), is a 28 x 28 pixel image with gray-scale levels ranging from 
0 (white) to 1 (black). Additionally, each image has a label, denoted by y, that 
corresponds to one of the 10 classes, y € {0,1,...,9}. Please refer to Fig. B.2 for 
an illustration of the dataset. 

Keras offers the advantage of having popular datasets, such as MNIST, readily 
available in a sub-package called keras.datasets. This sub-package includes both 


digit 
classifier 


t 


ena 


Figure B.1. Handwritten digit classification. 
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{(2,y)}m, m = 60,000 
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Figure B.2. MNIST dataset: An input image is of 28-by-28 pixels, each indicating an inten- 
sity from O (white) to 1 (black); each label with size 1 takes one of the 10 classes from 
O to 9. 


the train and test datasets, which are already properly split. Hence, there is no need 
to worry about the splitting process. The only requirement is to write a script as 
follows: 

from tensorflow.keras.datasets import mnist 

(X_train, y_train), (X_test, y_test) = mnist.load_dataQ 

X_train = X_train/255. 

X_test = X_test/255. 


Downloading data from https://storage. googleapis .com/ 
tensorflow/tf -keras-datasets/mnist.npz 

11493376/11490434 [=======================] - 1s Ous/step 
11501568/11490434 [=======================] - 1s Ous/step 


Normalization is an essential data preprocessing step. Here to this end, we divide 
the input data (X_train or X_test) by its maximum value of 255. If the dataset 
we want to use is not available in the keras.datasets sub-package, we need to be 
familiar with other data preprocessing techniques. The pandas library offers one 
such technique that is useful in handling .csv files. However, this book does not 
cover the usage of pandas in detail. If you want to learn more about pandas, you 
can refer to: 


https://pandas.pydata.org/ 


We use matplotlib.pyplot for data visualization. The following code shows how 
to plot a sample image: 
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Figure B.3. A sample image in MNIST dataset. 


import matplotlib.pyplot as plt 


plt.imshow(X_train[O], cmap = ’gray_r’) 
plt.colorbarQ 
plt.titleC {}’.format(y_train[O], fontsize=30)) 


The output of the code for plotting the sample image is shown in Fig. B.3. The 
‘gray_r’ option is used to enable the white background and a black letter, while 
‘gray’ is used for the flipped one, which is a white letter with a black background. 
The colorbar() function displays the color bar on the right, as seen in Fig. B.3. It is 
also possible to plot multiple images in a single figure. For instance, the following 
code shows how to display 60 images. 
num_of_images = 60 
for index in range(1,num_of_images+1): 
plt.subplot(6,10, index) 
plt.axisC off’) 
plt.imshow(X_train[index], cmap = ’gray_r’) 


See Fig. B.4 for the output. 


Building a neural network model We will use a two-layer neural network 
that was studied in Section 3.14. Specifically, we introduce a hidden layer with 500 
neurons as shown in Fig. B.5. The ReLU activation function is used at the hidden 
layer and softmax is used at the output layer. 

Keras includes two major packages: 


(i) tensorflow.keras.models; 


cii) tensorflow.keras.layers. 


The models package contains several functionalities regarding a neural network. 
One major module is Sequential which is a neural network entity and hence can 
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Figure B.4. Plotting many image samples in a single figure. 
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Figure B.5. A two-layer fully-connected neural network where input size is 28 x 28 = 784, 
the number of hidden neurons is 500 and the number of classes is 10. We employ ReLU 
activation at the hidden layer, and softmax activation at the output layer. 


be described as a linear stack of layers. The layers package in Keras includes various 
elements required for constructing a neural network, such as fully-connected dense 
layers and activation functions. These components enable us to easily build a model 
as depicted in Fig. B.5. 


from tensorflow.keras.models import Sequential 
from tensorflow.keras.layers import Dense, Flatten 


model = SequentialQ 
model.add(Flatten(input_shape=(28,28))) 
model.add(Dense(500, activation=’relu’)) 
model.add(Dense(10, activation=’softmax’)) 
model.summary() 
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Model: "sequential\_1” 


Layer (type) Output Shape Param # 
‘flatten (Flatten) (None, 784), 0 | 
dense (Dense) (None, 500) 392500 
dense\_1 (Dense) (None, 10) 5010 


Total params: 397,510 
Trainable params: 397,510 
Non-trainable params: O 


Flatten is a component that transforms a higher dimensional entity, such as a 
2D matrix, into a vector. In this instance, a 28-by-28 digit image is transformed 
into a vector of size 784(= 28 x 28). The add() method is used to append a layer 
to the end of the sequential model. Dense denotes a fully-connected layer, and the 
input size is automatically determined by the last layer to which it will be appended. 
The only parameter to specify is the number of output neurons, which is set to 500 
in this example, corresponding to the number of hidden neurons. We can also set 
an activation function, such as activation=’relu’, with an additional argument. The 
output layer has 10 neurons, which corresponds to the number of classes, and uses 
softmax activation to represent the likelihood of an output belonging to a particular 
class. The summaryQ function generates a list of all layers, specifying the size and 
number of associated parameters. 


Training a model First, we have to choose an optimizer algorithm. One popular 
algorithm is gradient descent, and we will be using its advanced version introduced 
in Section 3.14 called the Adam optimizer. Adam is an improved version of gradi- 
ent descent that provides more stable training. It has three important hyperparam- 
eters, namely the learning rate , £1 (which represents the weight of past gradients), 
and f2 (which indicates the weight of the square of past gradients). By default, 
these are set to (a, £1, 82) = (0.001, 0.9, 0.999). If we do not specify any values, 
these default values will be used. 

Next, we need to specify a loss function. As we learned in Section 3.13, the 
optimal choice for maximizing likelihood in multi-class cases is cross entropy. We 
also need to specify a performance metric that we will use to evaluate the model 
during training and testing. The accuracy metric is commonly used for this purpose. 
All of these can be set using the compile method. 

model.compile(optimizer='adam’, 


loss=’sparse_categorical_crossentropy’, 
metrics=['acc’]) 
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To manually choose the hyperparameters of the Adam optimizer, we can define: 


opt=tensorflow.keras.optimizers.Adam( 
learning_rate=0.01, 
beta_1 = 0.92, 
beta_2 = 0.992) 


Next, we replace the previous option with optimizer=opt. As for the loss option 
in the compile method, we will use ’sparse_categorical_crossentropy’ instead, 
which is suitable for cross entropy loss in cases beyond binary classification. 

With these settings, we can now train the model on MNIST data. During the 
training process, we use a portion of the total examples to compute the gradient of 
the loss function, which is called a batch. Two more terms are used in this context: 
a step refers to the process of computing the loss for the examples in a single batch, 
while an epoch refers to the entire process associated with all the examples. For our 
experiment, we use a batch size of 64 and train the model for 20 epochs. 


history = model.fit(X_train, y_train, batch_size=64, epochs=20) 


Epoch 1/20 
938/938 [===] - 2s 2ms/step - loss: 0.0025 - acc: 0.9992 
Epoch 2/20 
938/938 [===] - 2s 2ms/step - loss: 0.0059 - acc: 0.9981 
Epoch 3/20 


938/938 [===] - 2s 2ms/step - loss: 0.0031 - acc: 0.9990 
Epoch 4/20 
938/938 [===] - 2s 2ms/step - loss: 0.0074 - acc: 0.9976 


Epoch 5/20 
938/938 [===] - 2s 2ms/step - loss: 0.0025 - acc: 0.9993 
Epoch 6/20 
938/938 [===] - 2s 2ms/step - loss: 0.0043 - acc: 0.9984 
Epoch 7/20 
938/938 [===] - 2s 2ms/step - loss: 0.0044 - acc: 0.9984 
Epoch 8/20 
938/938 [===] - 2s 2ms/step - loss: 0.0010 - acc: 0.9998 
Epoch 9/20 
938/938 [===] - 2s 2ms/step - loss: 1.2813e-04 - acc: 1.0 
Epoch 10/20 
938/938 [===] - 2s 2ms/step - loss: 3.5169e-05 - acc: 1.0 
Epoch 11/20 
938/938 [===] - 2s 2ms/step - loss: 2.1899e-05 - acc: 1.0 
Epoch 12/20 
938/938 [===] - 2s 2ms/step - loss: 1.6756e-05 - acc: 1.0 


Epoch 13/20 
938/938 [===] - 2s 2ms/step - loss: 1.2778e-05 - acc: 1.0 
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Figure B.6. Accuracy as a function of epochs. 


Epoch 14/20 


938/938 [===] - 2s 2ms/step - loss: 9.8947e-06 - acc: 1.0 
Epoch 15/20 

938/938 [===] - 2s 2ms/step - loss: 0.0082 - acc: 0.9981 
Epoch 16/20 

938/938 [===] - 2s 2ms/step - loss: 0.0090 - acc: 0.9971 
Epoch 17/20 

938/938 [===] - 2s 2ms/step - loss: 0.0016 - acc: 0.9995 
Epoch 18/20 

938/938 [===] - 2s 2ms/step - loss: 3.9583e-04 - acc: 0.9999 
Epoch 19/20 

938/938 [===] - 2s 2ms/step - loss: 7.6672e-05 - acc: 1.0 
Epoch 20/20 

938/938 [===] - 2s 2ms/step - loss: 2.4958e-05 - acc: 1.0 


An advantage of using the fit() function is that it provides a dictionary of the 
metrics that were gathered during the training process. We can examine the metrics 
by running: 

# list all data in history object 
printChistory.history.keysQ) 


dict_keys([’loss’, ’acc’]) 


We can create a plot of the accuracy as a function of epochs using the collected 
data. 

plt.plotchistory.history[’acc’]) 

plt.title? model accuracy’) 

plt.xlabelC epoch’) 

plt.ylabel( accuracy’) 
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Testing the trained model To conduct testing, we need to predict the model 
output using the predict() function in the following manner: 


model.predict(X_test).argmax(1) 
array([7, 2, 1, .., 4, 5, 6], dtype=int64) 


The function argmax(1) retrieves the class with the highest softmax output among 
the 10 available classes. In order to assess the accuracy of the test set, we utilize the 
evaluate() function: 


model.evaluate(X_test, y_test) 


313/313 [===] - Os 751us/step - loss: 0.1001 - acc: 0.9847 


[0.10007859766483307, 0.9847000241279602] 


Saving and loading Saving and loading the trained model is a straightforward 
process, as shown below. 


model.save(’saved_classifier’) 


INFO:tensorflow:Assets written to: saved_classifier\assets 


import tensorflow 
loaded_model = tensorflow.keras.models.load_model( 
*saved_classifier’) 


DOE: 10.1561/9781638281153.ch6 


Appendix C 


A Special Note on Research 


In this appendix, we aim to provide some insights that could be valuable for your 
career advancement, specifically in the realm of conducting research. We will cover 
two key topics related to research. Firstly, we will provide advice on what aspects to 
concentrate on while conducting research. Secondly, we will outline a methodology 
for reading research papers. 


C.1 Power of Fundamentals 


An advice The main message we want to convey through this book is the impor- 
tance of fundamental concepts and tools such as phase transitions, entropy, mutual 
information, and the KL divergence. One piece of advice we would like to offer is to 
focus on strengthening your understanding of these fundamentals, particularly as 
they relate to modern technologies. To further explain what we mean, let’s examine 
how fundamentals have evolved from the past to the present day. 


Fundamentals in old days Technologies in the past were shaped by the Ist, 
2nd, and 3rd industrial revolutions, which were made possible by groundbreaking 
inventions inspired by scientific discoveries. The steam engine was the key inven- 
tion of the Ist industrial revolution, based on the principles of thermodynamics 
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in which physics and chemistry played foundational roles. The 2nd revolution was 
triggered by the invention of electricity, which is based on electromagnetism, again 
with physics as the underlying theory. The 3rd revolution was brought about by the 
computer, which is based on the invention of the semiconductor, with physics and 
chemistry providing the foundation. Although these fundamentals are crucial, they 
are not the ones we are referring to in this book. Our emphasis is on modern-day 
technologies. 


Fundamentals in modern days The 4th industrial revolution is currently 
driving modern day technologies. It is widely accepted that the main focus 
of this revolution is Artificial Intelligence (AI). It is important to note that 
machine learning and deep learning, which are key methodologies for achieving 
machine intelligence, rely on optimization techniques that fall under the umbrella 
of mathematics. Therefore, mathematics is a significant driving force for the 
development of AI. 


Four fundamentals in mathematics The 4th industrial revolution, centered 
around Artificial Intelligence (AI), relies heavily on mathematics. In particular, 
four branches of mathematics play foundational roles in AI: optimization, linear 
algebra, probability, and information theory. Optimization, a branch of mathemat- 
ics that deals with finding the optimal solution to a problem, is a key methodol- 
ogy in achieving machine intelligence. Linear algebra provides instrumental tools 
for obtaining simple and tractable formulas of the objective function and/or con- 
straints. Probability is used to deal with random quantities, and information theory 
sheds optimal architectural insights into machine learning models. These funda- 
mentals are essential in the 4th industrial revolution. Therefore, our advice is to 
be strong in these fundamentals. However, it is worth noting that it is easier to 
build these fundamentals when you are younger and in school, as you may not 
have enough time or stamina to develop a deep understanding of these principles 
after graduation. 


Programming skills Do fundamentals alone suffice? Unfortunately, the answer 
is negative. There is another vital skill that you must possess. Remember that 
the ultimate product of machine learning is “algorithms.” In other words, it is 
a set of instructions that typically require extensive computation. Manual com- 
putation is virtually impossible, and therefore, it must be performed on a com- 
puter. This is where programming tools, such as Python and TensorFlow, that we 
have utilized throughout this book, become essential. We strongly suggest that you 
become proficient in this tool as well. Swift implementations through exceptional 


386 A Special Note on Research 


programming abilities will assist you in realizing and advancing your concepts. 
One caveat to note is that programming tools change over time. This is due to 
the rapid evolution of computational resources and capabilities, which influence 
the efficiency of programming languages. As a result, you must stay up to date with 
such changes. 
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C.2 How to Read Papers? 


We would like to share with you our thoughts on a methodology for conducting 
research, specifically on how to approach reading papers. We believe this is a crucial 
skill for students, yet it is often not explicitly taught in university curriculums. 
Many students may approach reading papers without a clear strategy or purpose, 
which can be challenging even for experienced researchers. 

Disclaimer: The following recommendations are based on our own opinions. 


Two strategies We present two approaches: a passive strategy that is useful when 
you are trying to grasp the basics and trends of a field you are interested in (if you 
are not planning to write a paper), and an active strategy that is more suitable for 
those who want to become experts in a trending but not fully developed field, or 
for those who need to write papers on such topics. 


Passive approach The concept is to depend on the expertise of professionals. 
This means waiting until the experts can provide instruction in a very understand- 
able way, and then reading papers written by these professionals. Ideally, these 
papers should be of the survey type. Two questions that come to mind are: (i) how 
can you determine who the experts are?; and (ii) how can you read well-written 
papers effectively? 


How to figure out experts? We suggest three practices that can help you find 
experts in the field. The first practice is to approach professors, seniors, and peers 
who are already working in the field. In most cases, well-known figures in the 
field can be easily recognized, so you can simply ask them for recommendations. 
The second practice is to use Google search with appropriate keywords. Nowadays, 
many relevant blogs are available on websites. These blogs may provide references 
or pointers to experts in the field. Once you have identified an expert, you can check 
their track record, such as their Google citations. The third practice is to look for 
organizers and keynote speakers in flagship conferences and workshops. These indi- 
viduals are likely to be experts in the field. However, you may feel unsure when you 
identify scholars who are too young and have weak track records. In such cases, you 
may want to check their advisors, as great advisors usually produce great students. 


How to read well-written papers? After identifying the experts, the next step 
is to read their well-written papers. Survey papers that have been cited frequently 
are generally a good choice. Here are some guidelines on how to read them. 

First and foremost, it is crucial to read the papers intensively rather than simply 
skimming through them. It is best to read the paper while taking notes and attempt- 
ing to translate the authors’ words into your own understanding. It is recommended 
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to ponder on the theorems before reading their proofs. The main messages and their 
implications are usually conveyed through theorems, and comprehending them is 
the most important aspect. Attempting to read technical proofs prior to grasping 
what the theorems mean will be unproductive, and you may quickly lose interest 
or become burnt out. Once you understand the main messages, you can delve into 
the technical details. 

Secondly, it is recommended to read the paper multiple times until you have 
completely absorbed its contents. A well-written paper is like a bible, and repeated 
readings can provide a deep and diverse understanding of its contents. It is also 
important to be familiar with any proof techniques used in the paper. A good paper 
typically contains simple and insightful proof techniques. A short proof may not 
necessarily be good unless it is insightful. We believe that a good proof comes with 
insights, even if it is lengthy. 

Finally, it is essential to respect the experts’ notations, logical flows, mindsets, and 
writing styles. Expert writers usually spend a lot of time selecting their notations, 
developing a storyline, and refining their writing. Their choices are well-considered 
and carefully crafted. You may have your own preferred style, but if you find their 
methods to be better, don’t hesitate to adopt them. If you don’t have a preference 
or are unable to judge, simply follow theirs. It is worth emulating their style. 


Active approach Let’s now discuss the second approach, which is an active 
approach designed for individuals who wish to write papers in a relatively new 
or emerging field. This approach is suitable for those who cannot wait for experts 
to become available to teach the field. The spirit of the approach is based on trial 
and error. It involves reading numerous recent papers and going back and forth 
until you can grasp the big picture. This approach is not easy and requires two 
key skills: (i) the ability to quickly identify relevant and high-quality papers; and 
(ii) the ability to quickly comprehend the main ideas presented in these papers. 


How to identify relevant/good papers quickly? We recommend two 
approaches. The first is similar to the one suggested in the passive approach, which 
is to ask experts in the field, such as professors, seniors, or peers. However, we 
emphasize the importance of seeking out experts who are familiar with the field, 
as identifying good papers in a new field can be challenging. Non-experts may not 
be able to identify relevant papers easily, and even if they can, they may not be 
motivated to invest a significant amount of time searching for papers on behalf of 
others. 

The second approach is to search for papers in relevant conferences and work- 
shops, but in a strategic manner. With so many conferences and workshops, it 
can be overwhelming to go through all the papers. To streamline the process, we 
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suggest a two-step filtering method. First, look at the title and only consider papers 
that contain relevant keywords, are grammatically correct, and are well-written. If 
a paper fails to meet these criteria, remove it from the list. Second, read the abstract 
of the surviving papers. If the abstract is well-written and relevant, add the paper to 
a shortlist. Authors typically spend a significant amount of time crafting and refin- 
ing the abstract, so if it is poorly written, the main body is likely to be even worse, 
which can make it challenging to understand the paper’s main idea. Therefore, it is 
best to exclude such papers from the shortlist. 


Further short-list papers if needed After applying the aforementioned guide- 
lines, you might be left with a considerable number of papers. If the number of 
selected papers is approximately 10 or more, we advise you to refine the list by 
delving deeper into each paper. Here’s how to go about it: determine your objec- 
tive and what you hope to gain from reading each paper, then begin by reading 
the “introduction” section with your objective in mind. If the paper’s introduction 
is well-written and relevant to your goal, it should remain on your list. If not, it 
should be removed from the list. 


Figure out main ideas of the short-listed papers At this point, the list 
should only contain a few papers. If it still contains too many, repeat the previous 
step with stricter criteria until you have only a few papers left. Once you've narrowed 
down the list, search for the main idea sentence in the introduction of each paper. 
If you can’t find it or don’t understand it, read other sections of the paper until you 
do. Once you understand it, rephrase the main idea sentence in your own words 
and write it down in the heading on the first page of the paper. If you get frustrated 
during the process, it’s okay to stop reading and move on to the next paper. 


Do back-and-forth Continue the iterative process of summarizing the main 
ideas of the short-listed papers. Two important considerations to keep in mind 
are: first, do not invest too much time reading “related works.” These sections are 
often dense and not critical to the main story of the paper. They exist mainly to 
avoid criticism from non-cited authors. You may choose to ignore them entirely. 
Second, when investigating other references, read them quickly and stay focused on 
your goal. It’s best not to get sidetracked by reading other references or spending 
too much time on them. 

You can end the iterative process when you either: (a) gain a complete under- 
standing of the topic, or (b) find a well-written “anchor paper” that presents a clear 
overview of the subject. If you find an anchor paper, follow the passive approach to 
read it thoroughly. 
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Do your own research After figuring out the big picture, it’s best to concentrate 
on your own research and avoid getting sidetracked by reading more papers. Follow 
these steps to get started: 


1. Select a challenge that you want to tackle; 

2. Define a specific and tangible problem that can address the challenge; 

3. Attempt to solve the problem using conventional wisdom and first-principle 
thinking. 


It’s possible to encounter difficulties during the third step, especially if youre stuck. 
In that case, consider discussing the issue with your advisors, seniors, or peers. 


Two final remarks We have two final remarks to make. Firstly, we advise you not 
to give up during the research process. Research is a complex task and it is natural to 
feel discouraged or overwhelmed at times. However, persistence and patience will 
ultimately pay off. Secondly, regarding communication skills, it is crucial to quickly 
grasp the writing quality and main idea of a paper as outlined in the guidelines. We 
strongly encourage you to work on improving your reading comprehension and 
grammar skills. 
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